Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trouble with UTF-8 CSV input in Python

Tags:

This seems like it should be an easy fix, but so far a solution has eluded me. I have a single column csv file with non-ascii chars saved in utf-8 that I want to read in and store in a list. I'm attempting to follow the principle of the "Unicode Sandwich" and decode upon reading the file in:

import codecs import csv  with codecs.open('utf8file.csv', 'rU', encoding='utf-8') as file: input_file = csv.reader(file, delimiter=",", quotechar='|') list = [] for row in input_file:     list.extend(row) 

This produces the dread 'codec can't encode characters in position, ordinal not in range(128)' error.

I've also tried adapting a solution from this answer, which returns a similar error

def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs):     csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)     for row in csv_reader:         yield [unicode(cell, 'utf-8') for cell in row]  filename = 'inputs\encode.csv' reader = unicode_csv_reader(open(filename)) target_list = [] for field1 in reader:     target_list.extend(field1) 

A very similar solution adapted from the docs returns the same error.

def unicode_csv_reader(utf8_data, dialect=csv.excel):     csv_reader = csv.reader(utf_8_encoder(utf8_data), dialect)     for row in csv_reader:         yield [unicode(cell, 'utf-8') for cell in row]  def utf_8_encoder(unicode_csv_data):     for line in unicode_csv_data:     yield line.encode('utf-8')  filename = 'inputs\encode.csv' reader = unicode_csv_reader(open(filename)) target_list = [] for field1 in reader:     target_list.extend(field1) 

Clearly I'm missing something. Most of the questions that I've seen regarding this problem seem to predate Python 2.7, so an update here might be useful.

like image 536
acpigeon Avatar asked May 22 '12 21:05

acpigeon


2 Answers

Your first snippet won't work. You are feeding unicode data to the csv reader, which (as documented) can't handle it.

Your 2nd and 3rd snippets are confused. Something like the following is all that you need:

f = open('your_utf8_encoded_file.csv', 'rb') reader = csv.reader(f) for utf8_row in reader:     unicode_row = [x.decode('utf8') for x in utf8_row]     print unicode_row 
like image 58
John Machin Avatar answered Oct 06 '22 04:10

John Machin


At it fails from the first char to read, you may have a BOM. Use codecs.open('utf8file.csv', 'rU', encoding='utf-8-sig') if your file is UTF8 and has a BOM at the beginning.

like image 24
Zeugma Avatar answered Oct 06 '22 06:10

Zeugma