<p>I am trying to read a CSV file with accented characters with Python (only French and/or Spanish characters). Based on the Python 2.5 documentation for the csvreader (http://docs.python.org/library/csv.html), I came up with the following code to read the CSV file since the csvreader supports only ASCII.</p> <pre class="prettyprint"><code>def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs): # csv.py doesn't do Unicode; encode temporarily as UTF-8: csv_reader = csv.reader(utf_8_encoder(unicode_csv_data), dialect=dialect, **kwargs) for row in csv_reader: # decode UTF-8 back to Unicode, cell by cell: yield [unicode(cell, 'utf-8') for cell in row] def utf_8_encoder(unicode_csv_data): for line in unicode_csv_data: yield line.encode('utf-8') filename = 'output.csv' reader = unicode_csv_reader(open(filename)) try: products = [] for field1, field2, field3 in reader: ... </code></pre> <p>Below is an extract of the CSV file I am trying to read:</p> <pre class="prettyprint"><code>0665000FS10120684,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Bleu 0665000FS10120689,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Gris 0665000FS10120687,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Vert ... </code></pre> <p>Even though I try to encode/decode to UTF-8, I am still getting the following exception:</p> <pre class="prettyprint"><code>Traceback (most recent call last): File ".\Test.py", line 53, in <module> for field1, field2, field3 in reader: File ".\Test.py", line 40, in unicode_csv_reader for row in csv_reader: File ".\Test.py", line 46, in utf_8_encoder yield line.encode('utf-8', 'ignore') UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 68: ordinal not in range(128) </code></pre> <p>How do I fix this?</p>

<h3>Python 2.X</h3> <p>There is a unicode-csv library which should solve your problems, with added benefit of not naving to write any new csv-related code.</p> <p>Here is a example from their readme:</p> <pre class="prettyprint"><code>>>> import unicodecsv >>> from cStringIO import StringIO >>> f = StringIO() >>> w = unicodecsv.writer(f, encoding='utf-8') >>> w.writerow((u'é', u'ñ')) >>> f.seek(0) >>> r = unicodecsv.reader(f, encoding='utf-8') >>> row = r.next() >>> print row[0], row[1] é ñ </code></pre> <h3>Python 3.X</h3> <p>In python 3 this is supported out of the box by the build-in <code>csv</code> module. See this example: </p> <pre class="prettyprint"><code>import csv with open('some.csv', newline='', encoding='utf-8') as f: reader = csv.reader(f) for row in reader: print(row) </code></pre>

Reading a UTF8 CSV file with Python

Tags:

python

character-encoding

csv

utf-8

I am trying to read a CSV file with accented characters with Python (only French and/or Spanish characters). Based on the Python 2.5 documentation for the csvreader (http://docs.python.org/library/csv.html), I came up with the following code to read the CSV file since the csvreader supports only ASCII.

def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):     # csv.py doesn't do Unicode; encode temporarily as UTF-8:     csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),                             dialect=dialect, **kwargs)     for row in csv_reader:         # decode UTF-8 back to Unicode, cell by cell:         yield [unicode(cell, 'utf-8') for cell in row]  def utf_8_encoder(unicode_csv_data):     for line in unicode_csv_data:         yield line.encode('utf-8')  filename = 'output.csv' reader = unicode_csv_reader(open(filename)) try:     products = []     for field1, field2, field3 in reader:         ...

Below is an extract of the CSV file I am trying to read:

0665000FS10120684,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Bleu 0665000FS10120689,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Gris 0665000FS10120687,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Vert ...

Even though I try to encode/decode to UTF-8, I am still getting the following exception:

Traceback (most recent call last):   File ".\Test.py", line 53, in <module>     for field1, field2, field3 in reader:   File ".\Test.py", line 40, in unicode_csv_reader     for row in csv_reader:   File ".\Test.py", line 46, in utf_8_encoder     yield line.encode('utf-8', 'ignore') UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 68: ordinal not in range(128)

How do I fix this?

434

asked May 24 '09 15:05

Martin

2 Answers

The .encode method gets applied to a Unicode string to make a byte-string; but you're calling it on a byte-string instead... the wrong way 'round! Look at the codecs module in the standard library and codecs.open in particular for better general solutions for reading UTF-8 encoded text files. However, for the csv module in particular, you need to pass in utf-8 data, and that's what you're already getting, so your code can be much simpler:

import csv  def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs):     csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)     for row in csv_reader:         yield [unicode(cell, 'utf-8') for cell in row]  filename = 'da.csv' reader = unicode_csv_reader(open(filename)) for field1, field2, field3 in reader:   print field1, field2, field3

PS: if it turns out that your input data is NOT in utf-8, but e.g. in ISO-8859-1, then you do need a "transcoding" (if you're keen on using utf-8 at the csv module level), of the form line.decode('whateverweirdcodec').encode('utf-8') -- but probably you can just use the name of your existing encoding in the yield line in my code above, instead of 'utf-8', as csv is actually going to be just fine with ISO-8859-* encoded bytestrings.

175

answered Sep 19 '22 01:09

Alex Martelli

Python 2.X

There is a unicode-csv library which should solve your problems, with added benefit of not naving to write any new csv-related code.

Here is a example from their readme:

>>> import unicodecsv >>> from cStringIO import StringIO >>> f = StringIO() >>> w = unicodecsv.writer(f, encoding='utf-8') >>> w.writerow((u'é', u'ñ')) >>> f.seek(0) >>> r = unicodecsv.reader(f, encoding='utf-8') >>> row = r.next() >>> print row[0], row[1] é ñ

Python 3.X

In python 3 this is supported out of the box by the build-in csv module. See this example:

import csv with open('some.csv', newline='', encoding='utf-8') as f:     reader = csv.reader(f)     for row in reader:         print(row)

answered Sep 18 '22 01:09

jb.

Related questions
                            
                                How to transform numpy.matrix or array to scipy sparse matrix
                            
                                What exactly is the point of memoryview in Python
                            
                                Safely create a file if and only if it does not exist with Python
                            
                                Is enumerate in python lazy?
                            
                                Using ConfigParser to read a file without section name
                            
                                Using a pre-trained word embedding (word2vec or Glove) in TensorFlow
                            
                                Dynamically import a method in a file, from a string
                            
                                is it possible to do fuzzy match merge with python pandas?
                            
                                Find all occurrences of a key in nested dictionaries and lists
                            
                                Apache Spark: How to use pyspark with Python 3
                            
                                How to delete all columns in DataFrame except certain ones?
                            
                                Selenium: FirefoxProfile exception Can't load the profile
                            
                                Convert a space delimited string to list [duplicate]
                            
                                Python Pandas How to assign groupby operation results back to columns in parent dataframe?
                            
                                python request with authentication (access_token)
                            
                                How to create an empty R vector to add new items
                            
                                Django Rest Framework - How to add custom field in ModelSerializer
                            
                                Copy file with pathlib in Python
                            
                                How to redirect stdout to both file and console with scripting?
                            
                                Python - add PYTHONPATH during command line module run

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With