I'm trying to write out a csv file with Unicode characters, so I'm using the unicodecsv package. Unfortunately, I'm still getting UnicodeDecodeErrors:
# -*- coding: utf-8 -*-
import codecs
import unicodecsv
raw_contents = 'He observes an “Oversized Gorilla” near Ashford'
encoded_contents = unicode(raw_contents, errors='replace')
with codecs.open('test.csv', 'w', 'UTF-8') as f:
w = unicodecsv.writer(f, encoding='UTF-8')
w.writerow(["1", encoded_contents])
This is the traceback:
Traceback (most recent call last):
File "unicode_test.py", line 11, in <module>
w.writerow(["1", encoded_contents])
File "/Library/Python/2.7/site-packages/unicodecsv/__init__.py", line 83, in writerow
self.writer.writerow(_stringify_list(row, self.encoding, self.encoding_errors))
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 691, in write
return self.writer.write(data)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 17: ordinal not in range(128)
I thought converting it to Unicode would be good enough, but that doesn't seem to be the case. I'd really like to understand what is happening so that I'm better prepared to handle these errors in other projects in the future.
From the traceback, it looks like I can reproduce the error like this:
>>> raw_contents = 'He observes an “Oversized Gorilla” near Ashford'
>>> raw_contents.encode('UTF-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 15: ordinal not in range(128)
>>>
Up until now, I thought I had a decent working knowledge of working with Unicode text in Python 2.x, but this has humbled me.
The unicodecsv is a drop-in replacement for Python 2.7's csv module which supports unicode strings without a hassle. Supported versions are python 2.7, 3.3, 3.4, 3.5, and pypy 2.4. 0.
You can use the pandas. read_csv() and to_csv() functions to read and write a CSV file using various encodings (e.g., UTF-8, ASCII, ANSI, ISO) as defined in the encoding argument of both functions.
Navigate to Data → Get External Data → From Text. Navigate to the location of the CSV file you want to import. Choose the Delimited option. Set the character encoding File Origin to 65001: Unicode (UTF-8) from the drop-down list.
You should not use codecs.open()
for your file. unicodecsv
wraps the csv
module, which always writes a byte string to the open file object. In order to write that byte string to a Unicode-aware file object such as returned by codecs.open()
, it is implicitly decoded; this is where your UnicodeDecodeError
exception stems from.
Use a file in binary mode instead:
with open('test.csv', 'wb') as f:
w = unicodecsv.writer(f, encoding='UTF-8')
w.writerow(["1", encoded_contents])
The binary mode is not strictly necessary unless your data contains embedded newlines, but the csv
module wants to control how newlines are written to ensure that such values are handled correctly. However, not using codecs.open()
is an absolute requirement.
The same thing happens when you call .encode()
on a byte string; you already have encoded data there, so Python implicitly decodes to get a Unicode value to encode.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With