Trouble with UTF-8 CSV input in Python

Question

This seems like it should be an easy fix, but so far a solution has eluded me. I have a single column csv file with non-ascii chars saved in utf-8 that I want to read in and store in a list. I'm attempting to follow the principle of the "Unicode Sandwich" and decode upon reading the file in:

import codecs import csv  with codecs.open('utf8file.csv', 'rU', encoding='utf-8') as file: input_file = csv.reader(file, delimiter=",", quotechar='|') list = [] for row in input_file:     list.extend(row)

This produces the dread 'codec can't encode characters in position, ordinal not in range(128)' error.

I've also tried adapting a solution from this answer, which returns a similar error

def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs):     csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)     for row in csv_reader:         yield [unicode(cell, 'utf-8') for cell in row]  filename = 'inputs\encode.csv' reader = unicode_csv_reader(open(filename)) target_list = [] for field1 in reader:     target_list.extend(field1)

A very similar solution adapted from the docs returns the same error.

def unicode_csv_reader(utf8_data, dialect=csv.excel):     csv_reader = csv.reader(utf_8_encoder(utf8_data), dialect)     for row in csv_reader:         yield [unicode(cell, 'utf-8') for cell in row]  def utf_8_encoder(unicode_csv_data):     for line in unicode_csv_data:     yield line.encode('utf-8')  filename = 'inputs\encode.csv' reader = unicode_csv_reader(open(filename)) target_list = [] for field1 in reader:     target_list.extend(field1)

Clearly I'm missing something. Most of the questions that I've seen regarding this problem seem to predate Python 2.7, so an update here might be useful.

John Machin · Accepted Answer

Your first snippet won't work. You are feeding unicode data to the csv reader, which (as documented) can't handle it.

Your 2nd and 3rd snippets are confused. Something like the following is all that you need:

f = open('your_utf8_encoded_file.csv', 'rb') reader = csv.reader(f) for utf8_row in reader:     unicode_row = [x.decode('utf8') for x in utf8_row]     print unicode_row

Zeugma · Answer

At it fails from the first char to read, you may have a BOM. Use codecs.open('utf8file.csv', 'rU', encoding='utf-8-sig') if your file is UTF8 and has a BOM at the beginning.

Trouble with UTF-8 CSV input in Python

Tags:

acpigeon

2 Answers

John Machin

Zeugma

Recent Activity

Donate For Us

Trouble with UTF-8 CSV input in Python

Tags:

acpigeon

2 Answers

John Machin

Zeugma

Related questions

Recent Activity

Donate For Us