Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python csv: UnicodeDecodeError

I'm reading in a file with Python's csv module, and have Yet Another Encoding Question (sorry, there are so many on here).

In the CSV file, there are £ signs. After reading the row in and printing it, they have become \xa3.

Trying to encode them as Unicode produces a UnicodeDecodeError:

row = [unicode(x.strip()) for x in row]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3 in position 0: ordinal not in range(128)

I have been reading the csv documentation and the numerous other questions about this on StackOverflow. I think that £ becoming \xa3 in ASCII means that the original CSV file is in UTF-8.

(Incidentally, is there a quick way to check the encoding of a CSV file?)

If it's in UTF-8, then shouldn't the csv module be able to cope with it? It seems to be transforming all the symbols into ASCII, even though the documentation claims it accepts UTF-8.

I've tried adding a unicode_csv_reader function as described in the csv examples, but it doesn't help.

---- EDIT -----

I should clarify one thing. I have seen this question, which looks very similar. But adding the unicode_csv_reader function defined there produces a different error instead:

yield [unicode(cell, 'utf-8') for cell in row]
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa3 in position 8: unexpected code byte

So maybe my file isn't UTF8 after all? How can I tell?

like image 873
AP257 Avatar asked Aug 13 '10 19:08

AP257


1 Answers

Try using the "ISO-8859-1" for your encoding. It seems like you are dealing with extended ASCII, not Unicode.

Edit:

Here's some simple code that deals with extended ASCII:

>>> s = "La Pe\xf1a"
>>> print s
La Pe±a
>>> print s.decode("latin-1")
La Peña
>>>

Even better, dealing with the exact character that is giving you problems:

>>> s = "12\xa3"
>>> print s.decode("latin-1")
12£
>>>
like image 187
riwalk Avatar answered Oct 24 '22 11:10

riwalk