This seems like it should be an easy fix, but so far a solution has eluded me. I have a single column csv file with non-ascii chars saved in utf-8 that I want to read in and store in a list. I'm attempting to follow the principle of the "Unicode Sandwich" and decode upon reading the file in:
import codecs import csv with codecs.open('utf8file.csv', 'rU', encoding='utf-8') as file: input_file = csv.reader(file, delimiter=",", quotechar='|') list = [] for row in input_file: list.extend(row)
This produces the dread 'codec can't encode characters in position, ordinal not in range(128)' error.
I've also tried adapting a solution from this answer, which returns a similar error
def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs): csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs) for row in csv_reader: yield [unicode(cell, 'utf-8') for cell in row] filename = 'inputs\encode.csv' reader = unicode_csv_reader(open(filename)) target_list = [] for field1 in reader: target_list.extend(field1)
A very similar solution adapted from the docs returns the same error.
def unicode_csv_reader(utf8_data, dialect=csv.excel): csv_reader = csv.reader(utf_8_encoder(utf8_data), dialect) for row in csv_reader: yield [unicode(cell, 'utf-8') for cell in row] def utf_8_encoder(unicode_csv_data): for line in unicode_csv_data: yield line.encode('utf-8') filename = 'inputs\encode.csv' reader = unicode_csv_reader(open(filename)) target_list = [] for field1 in reader: target_list.extend(field1)
Clearly I'm missing something. Most of the questions that I've seen regarding this problem seem to predate Python 2.7, so an update here might be useful.
Your first snippet won't work. You are feeding unicode data to the csv reader, which (as documented) can't handle it.
Your 2nd and 3rd snippets are confused. Something like the following is all that you need:
f = open('your_utf8_encoded_file.csv', 'rb') reader = csv.reader(f) for utf8_row in reader: unicode_row = [x.decode('utf8') for x in utf8_row] print unicode_row
At it fails from the first char to read, you may have a BOM. Use codecs.open('utf8file.csv', 'rU', encoding='utf-8-sig')
if your file is UTF8 and has a BOM at the beginning.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With