Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handling non-utf8 characters in csv in Python 3 vs Python 2

I have the following code which reads csv files (some containing non-UTF8 characters). It works well in Python 2.7.x:

    encodings = {'ukprocessed.csv': 'utf8',
                 'usprocessed.csv': 'utf8',
                 'uyprocessed.csv': 'latin1',
                 'arprocessed.csv': 'latin1'}

    with codecs.open(filepath, 'r') as csvfile:
        reader = csv.reader(csvfile)
        for row in reader:
            row = [x.decode(encodings[filename]).encode('utf8') for x in row]

However, in Python 3.4.x the tests fail with a variety of errors:

  • AttributeError: 'str' object has no attribute 'decode'
  • UnicodeDecodeError: 'ascii' codec can't decode byte 0xf1 in position 1078: ordinal not in range(128) etc...

I have played around with specifying 'encoding=' in the file open, opening as bytes with 'rb' and a number of other things but I can't find a solution which works in both Python 2 and 3.

Does anyone have any ideas as to how I can fix this?

Thanks

like image 985
Mike91 Avatar asked Dec 14 '22 10:12

Mike91


2 Answers

Narrowly addressing the cause of your error:

In Py3, the x values in each row are str (similar to Py2 unicode). In Py2, str and unicode were too flexible, because str was both a text and binary data type; it supported encode, sort of, by assuming the str was ASCII, decoding it, then reencoding as the chosen encoding (which for ASCII compatible codecs was pointless, since it would error when non-ASCII was encountered). And for symmetry, a similarly error prone and pointless decode of unicode types was allowed; it would encode to ASCII (erroring if the unicode contained non-ASCII), then decode in the requested codec. This was a source of all sorts of misunderstandings, errors, etc.

In Python 3, they split the types up better:

  1. str is the text type, and only has an encode method (to convert from logical characters to a specific binary encoding of said characters)
  2. bytes (and other bytes-like types) represent binary data, and only have a decode method (to convert from a specific binary encoding to logical characters)

Solving your problem in a portable way:

Your code requires that "pure text" types support decode (binary->text), and as I noted, Py2 allows this in a limited sense, even though it's usually dumb. Py3 doesn't; decode-ing logical text to logical text is nonsensical, and to avoid silent misbehavior, Py3 doesn't provide the invalid methods (Py2 will work depending on the content of the unicode object, then fail when it's wrong; you'll think your code is non-English friendly, then it breaks if you actually use it with non-English text).

Writing csv code that must handle non-ASCII types isn't trivial if you need full portability. Here's the problem:

  1. In Python 2, you must work with str (bytes-oriented) encoded as some encoding that doesn't include embedded NULs, not unicode. Note: unicode happens to work by coincidence if it contains only text in the encoding returned by sys.getdefaultencoding(), because csv silently encodes it using the value, usually ascii, configured on startup by the site module using sys.setdefaultencoding; site deletes sys.setdefaultencoding after calling it, you're not supposed to tweak it yourself; it breaks when the input has anything that doesn't fit the locale encoding during csv's forcible conversion from unicode back to str. It's not just a matter of your system locale either; on my system, using LANG=en_US.latin-1 or LANG=en_US.utf-8, Python is still returning 'ascii' as my sys.getdefaultencoding().
  2. In Python 3, you must work with str (text-oriented, equivalent to Py2's unicode)

Normally, for non-csv related cases, I'd recommend using io.open to get full compatibility between Py2.7 and Py3.x (and better performance/compatibility than codecs.open) with a purely text-based type. But io.open (and codecs.open for that matter) in text mode returns unicode on Py2 (can't be used with csv unless it's representable in the default encoding, so you'll think it works until you feed it something the default encoding can't handle), and str in Py3 (fine); in binary mode, it returns str on Py2 (fine if no embedded NULs, though it's not decoding for you, so you'd need to both decode from str to unicode, then encode back from unicode to utf-8 str) and bytes in Py3 (would need to be decoded to str). It's ugly.

The best solution I can give is to use io.open, but add a version dependent wrappers around the iterators produced at specific steps to ensure the output from the iterator is in the appropriate form for the given Python version (utf-8 encoded str in Py2, str in Py3), giving you consistent behavior (and limiting the version checks to being performed a fixed number of times per file, not once per line):

import io
import sys

encodings = {'ukprocessed.csv': 'utf8',
             'usprocessed.csv': 'utf8',
             'uyprocessed.csv': 'latin1',
             'arprocessed.csv': 'latin1'}

# io.open in text mode will return unicode on Py2, str on Py3, decoded appropriately
# newline='' prevents it from doing line ending conversions (which are csv's
# responsibility)
with io.open(filepath, encoding=encodings[filepath], newline='') as csvdata:
    if sys.version_info[0] == 2:
        # Lazily convert lines from unicode to utf-8 encoded str
        csvdata = (line.encode('utf-8') for line in csvdata)
    reader = csv.reader(csvdata)
    if sys.version_info[0] == 2:
        # Decode row values to unicode on Py2; they're already str in Py3
        reader = ([x.decode('utf-8') for x in row] for row in reader)
    for row in reader:
        # operate on row containing native text types as values that can
        # represent whole Unicode range (unicode on Py2, str on Py3)
        ...
like image 64
ShadowRanger Avatar answered Dec 17 '22 00:12

ShadowRanger


csv isn't very portable between Python 2 and 3, but unicodecsv is:

from __future__ import print_function
import io
import unicodecsv

encodings = {'ukprocessed.csv': 'utf8',
             'usprocessed.csv': 'utf8',
             'uyprocessed.csv': 'latin1',
             'arprocessed.csv': 'latin1'}

for filepath,enc in encodings.items():
    with open(filepath, 'rb') as csvfile:
        reader = unicodecsv.reader(csvfile,encoding=enc)
        for row in reader:
            print(row)

This was tested on Python 2.7, 3.3 and 3.5. It is important to open in rb mode because unicodecsv works with byte strings similar to Python 2.7's csv module. Python 3.X's csv module works with Unicode strings directly.

like image 41
Mark Tolonen Avatar answered Dec 16 '22 23:12

Mark Tolonen