I have the following code which reads csv files (some containing non-UTF8 characters). It works well in Python 2.7.x:
encodings = {'ukprocessed.csv': 'utf8',
'usprocessed.csv': 'utf8',
'uyprocessed.csv': 'latin1',
'arprocessed.csv': 'latin1'}
with codecs.open(filepath, 'r') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
row = [x.decode(encodings[filename]).encode('utf8') for x in row]
However, in Python 3.4.x the tests fail with a variety of errors:
I have played around with specifying 'encoding=' in the file open, opening as bytes with 'rb' and a number of other things but I can't find a solution which works in both Python 2 and 3.
Does anyone have any ideas as to how I can fix this?
Thanks
In Py3, the x
values in each row
are str
(similar to Py2 unicode
). In Py2, str
and unicode
were too flexible, because str
was both a text and binary data type; it supported encode
, sort of, by assuming the str
was ASCII, decoding it, then reencoding as the chosen encoding (which for ASCII compatible codecs was pointless, since it would error when non-ASCII was encountered). And for symmetry, a similarly error prone and pointless decode
of unicode
types was allowed; it would encode
to ASCII (erroring if the unicode
contained non-ASCII), then decode
in the requested codec. This was a source of all sorts of misunderstandings, errors, etc.
In Python 3, they split the types up better:
str
is the text type, and only has an encode
method (to convert from logical characters to a specific binary encoding of said characters)bytes
(and other bytes
-like types) represent binary data, and only have a decode
method (to convert from a specific binary encoding to logical characters)Your code requires that "pure text" types support decode
(binary->text), and as I noted, Py2 allows this in a limited sense, even though it's usually dumb. Py3 doesn't; decode
-ing logical text to logical text is nonsensical, and to avoid silent misbehavior, Py3 doesn't provide the invalid methods (Py2 will work depending on the content of the unicode
object, then fail when it's wrong; you'll think your code is non-English friendly, then it breaks if you actually use it with non-English text).
Writing csv
code that must handle non-ASCII types isn't trivial if you need full portability. Here's the problem:
str
(bytes-oriented) encoded as some encoding that doesn't include embedded NUL
s, not unicode
. Note: unicode
happens to work by coincidence if it contains only text in the encoding returned by sys.getdefaultencoding()
, because csv
silently encodes it using the value, usually ascii
, configured on startup by the site
module using sys.setdefaultencoding
; site
deletes sys.setdefaultencoding
after calling it, you're not supposed to tweak it yourself; it breaks when the input has anything that doesn't fit the locale encoding during csv
's forcible conversion from unicode
back to str
. It's not just a matter of your system locale either; on my system, using LANG=en_US.latin-1
or LANG=en_US.utf-8
, Python is still returning 'ascii'
as my sys.getdefaultencoding()
.str
(text-oriented, equivalent to Py2's unicode
)Normally, for non-csv
related cases, I'd recommend using io.open
to get full compatibility between Py2.7 and Py3.x (and better performance/compatibility than codecs.open
) with a purely text-based type. But io.open
(and codecs.open
for that matter) in text mode returns unicode
on Py2 (can't be used with csv
unless it's representable in the default encoding, so you'll think it works until you feed it something the default encoding can't handle), and str
in Py3 (fine); in binary mode, it returns str
on Py2 (fine if no embedded NUL
s, though it's not decoding for you, so you'd need to both decode from str
to unicode
, then encode back from unicode
to utf-8
str
) and bytes
in Py3 (would need to be decoded to str
). It's ugly.
The best solution I can give is to use io.open
, but add a version dependent wrappers around the iterators produced at specific steps to ensure the output from the iterator is in the appropriate form for the given Python version (utf-8
encoded str
in Py2, str
in Py3), giving you consistent behavior (and limiting the version checks to being performed a fixed number of times per file, not once per line):
import io
import sys
encodings = {'ukprocessed.csv': 'utf8',
'usprocessed.csv': 'utf8',
'uyprocessed.csv': 'latin1',
'arprocessed.csv': 'latin1'}
# io.open in text mode will return unicode on Py2, str on Py3, decoded appropriately
# newline='' prevents it from doing line ending conversions (which are csv's
# responsibility)
with io.open(filepath, encoding=encodings[filepath], newline='') as csvdata:
if sys.version_info[0] == 2:
# Lazily convert lines from unicode to utf-8 encoded str
csvdata = (line.encode('utf-8') for line in csvdata)
reader = csv.reader(csvdata)
if sys.version_info[0] == 2:
# Decode row values to unicode on Py2; they're already str in Py3
reader = ([x.decode('utf-8') for x in row] for row in reader)
for row in reader:
# operate on row containing native text types as values that can
# represent whole Unicode range (unicode on Py2, str on Py3)
...
csv
isn't very portable between Python 2 and 3, but unicodecsv
is:
from __future__ import print_function
import io
import unicodecsv
encodings = {'ukprocessed.csv': 'utf8',
'usprocessed.csv': 'utf8',
'uyprocessed.csv': 'latin1',
'arprocessed.csv': 'latin1'}
for filepath,enc in encodings.items():
with open(filepath, 'rb') as csvfile:
reader = unicodecsv.reader(csvfile,encoding=enc)
for row in reader:
print(row)
This was tested on Python 2.7, 3.3 and 3.5. It is important to open in rb
mode because unicodecsv
works with byte strings similar to Python 2.7's csv
module. Python 3.X's csv
module works with Unicode strings directly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With