Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UnicodeDecodeError when using Python 2.x unicodecsv

I'm trying to write out a csv file with Unicode characters, so I'm using the unicodecsv package. Unfortunately, I'm still getting UnicodeDecodeErrors:

# -*- coding: utf-8 -*-

import codecs
import unicodecsv

raw_contents = 'He observes an “Oversized Gorilla” near Ashford'
encoded_contents = unicode(raw_contents, errors='replace')

with codecs.open('test.csv', 'w', 'UTF-8') as f:
    w = unicodecsv.writer(f, encoding='UTF-8')
    w.writerow(["1", encoded_contents])

This is the traceback:

Traceback (most recent call last):
  File "unicode_test.py", line 11, in <module>
    w.writerow(["1", encoded_contents])
  File "/Library/Python/2.7/site-packages/unicodecsv/__init__.py", line 83, in writerow
    self.writer.writerow(_stringify_list(row, self.encoding, self.encoding_errors))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 691, in write
    return self.writer.write(data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 17: ordinal not in range(128)

I thought converting it to Unicode would be good enough, but that doesn't seem to be the case. I'd really like to understand what is happening so that I'm better prepared to handle these errors in other projects in the future.

From the traceback, it looks like I can reproduce the error like this:

>>> raw_contents = 'He observes an “Oversized Gorilla” near Ashford'
>>> raw_contents.encode('UTF-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 15: ordinal not in range(128)
>>> 

Up until now, I thought I had a decent working knowledge of working with Unicode text in Python 2.x, but this has humbled me.

like image 960
Scott Avatar asked Jul 31 '14 15:07

Scott


People also ask

What is Unicodecsv in Python?

The unicodecsv is a drop-in replacement for Python 2.7's csv module which supports unicode strings without a hassle. Supported versions are python 2.7, 3.3, 3.4, 3.5, and pypy 2.4. 0.

How do I read a UTF 8 CSV file in Python?

You can use the pandas. read_csv() and to_csv() functions to read and write a CSV file using various encodings (e.g., UTF-8, ASCII, ANSI, ISO) as defined in the encoding argument of both functions.

How do I convert a unicode character to a CSV file?

Navigate to Data → Get External Data → From Text. Navigate to the location of the CSV file you want to import. Choose the Delimited option. Set the character encoding File Origin to 65001: Unicode (UTF-8) from the drop-down list.


1 Answers

You should not use codecs.open() for your file. unicodecsv wraps the csv module, which always writes a byte string to the open file object. In order to write that byte string to a Unicode-aware file object such as returned by codecs.open(), it is implicitly decoded; this is where your UnicodeDecodeError exception stems from.

Use a file in binary mode instead:

with open('test.csv', 'wb') as f:
    w = unicodecsv.writer(f, encoding='UTF-8')
    w.writerow(["1", encoded_contents])

The binary mode is not strictly necessary unless your data contains embedded newlines, but the csv module wants to control how newlines are written to ensure that such values are handled correctly. However, not using codecs.open() is an absolute requirement.

The same thing happens when you call .encode() on a byte string; you already have encoded data there, so Python implicitly decodes to get a Unicode value to encode.

like image 172
Martijn Pieters Avatar answered Oct 23 '22 00:10

Martijn Pieters