[Update] Appreciate the answers and input all around, but working code would be most welcome. If you can supply code that can read the sample files you are king (or queen).
[Update 2] Thanks for the excellent answers and discussion. What I need to do with these is to read them in, parse them, and save parts of them in Django model instances. I believe that means converting them from their native encoding to unicode so Django can deal with them, right?
There are several questions on Stackoverflow already on the subject of non-ascii python CSV reading, but the solutions shown there and in the python documentation don't work with the input files I'm trying.
The gist of the solution seems to be to encode('utf-8') the input to the CSV reader and unicode(item, 'utf-8') the output of the reader. However, this runs into UnicodeDecodeError issues (see above questions):
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa3 in position 8: unexpected
The input file is not necessarily in utf8; it can be ISO-8859-1, cp1251, or just about anything else.
So, the question: what's a resilient, cross-encoding capable way to read CSV files in Python?
The root of the issue seems to be that the CSV module is a C extension; is there a pure-python CSV reading module?
If not, is there a way to confidently detect the encoding of the input file so that it can be processed?
Basically I'm looking for a bullet proof way to read (and hopefully write) CSV files in any encoding.
Here are two sample files: European, Russian.
And here's the recommended solution failing:
Python 2.6.4 (r264:75821M, Oct 27 2009, 19:48:32)
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import csv
>>> def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
... # csv.py doesn't do Unicode; encode temporarily as UTF-8:
... csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
... dialect=dialect, **kwargs)
... for row in csv_reader:
... # decode UTF-8 back to Unicode, cell by cell:
... yield [unicode(cell, 'utf-8') for cell in row]
...
>>> def utf_8_encoder(unicode_csv_data):
... for line in unicode_csv_data:
... yield line.encode('utf-8')
...
>>> r = unicode_csv_reader(file('sample-euro.csv').read().split('\n'))
>>> line = r.next()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 5, in unicode_csv_reader
File "<stdin>", line 3, in utf_8_encoder
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf8 in position 14: ordinal not in range(128)
>>> r = unicode_csv_reader(file('sample-russian.csv').read().split('\n'))
>>> line = r.next()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 5, in unicode_csv_reader
File "<stdin>", line 3, in utf_8_encoder
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 28: ordinal not in range(128)
You are attempting to apply a solution to a different problem. Note this:
def utf_8_encoder(unicode_csv_data)
You are feeding it str
objects.
The problems with reading your non-ASCII CSV files is that you don't know the encoding and you don't know the delimiter. If you do know the encoding (and it's an ASCII-based encoding (e.g. cp125x, any East Asian encoding, UTF-8, not UTF-16, not UTF-32)), and the delimiter, this will work:
for row in csv.reader("foo.csv", delimiter=known_delimiter):
row = [item.decode(encoding) for item in row]
Your sample_euro.csv looks like cp1252 with comma delimiter. The Russian one looks like cp1251 with semicolon delimiter. By the way, it seems from the contents that you will also need to determine what date format is being used and maybe the currency also -- the Russian example has money amounts followed by a space and the Cyrillic abbreviation for "roubles".
Note carefully: Resist all attempts to persuade you that you have files encoded in ISO-8859-1. They are encoded in cp1252.
Update in response to comment """If I understand what you're saying I must know the encoding in order for this to work? In the general case I won't know the encoding and based on the other answer guessing the encoding is very difficult, so I'm out of luck?"""
You must know the encoding for ANY file-reading exercise to work.
Guessing the encoding correctly all the time for any encoding in any size file is not very difficult -- it's impossible. However restricting the scope to csv files saved out of Excel or Open Office in the user's locale's default encoding, and of a reasonable size, it's not such a big task. I'd suggest giving chardet a try; it guesses windows-1252
for your euro file and windows-1251
for your Russian file -- a fantastic achievement given their tiny size.
Update 2 in response to """working code would be most welcome"""
Working code (Python 2.x):
from chardet.universaldetector import UniversalDetector
chardet_detector = UniversalDetector()
def charset_detect(f, chunk_size=4096):
global chardet_detector
chardet_detector.reset()
while 1:
chunk = f.read(chunk_size)
if not chunk: break
chardet_detector.feed(chunk)
if chardet_detector.done: break
chardet_detector.close()
return chardet_detector.result
# Exercise for the reader: replace the above with a class
import csv
import sys
from pprint import pprint
pathname = sys.argv[1]
delim = sys.argv[2] # allegedly known
print "delim=%r pathname=%r" % (delim, pathname)
with open(pathname, 'rb') as f:
cd_result = charset_detect(f)
encoding = cd_result['encoding']
confidence = cd_result['confidence']
print "chardet: encoding=%s confidence=%.3f" % (encoding, confidence)
# insert actions contingent on encoding and confidence here
f.seek(0)
csv_reader = csv.reader(f, delimiter=delim)
for bytes_row in csv_reader:
unicode_row = [x.decode(encoding) for x in bytes_row]
pprint(unicode_row)
Output 1:
delim=',' pathname='sample-euro.csv'
chardet: encoding=windows-1252 confidence=0.500
[u'31-01-11',
u'Overf\xf8rsel utland',
u'UTLBET; ID 9710032001647082',
u'1990.00',
u'']
[u'31-01-11',
u'Overf\xf8ring',
u'OVERF\xd8RING MELLOM EGNE KONTI',
u'5750.00',
u';']
Output 2:
delim=';' pathname='sample-russian.csv'
chardet: encoding=windows-1251 confidence=0.602
[u'-',
u'04.02.2011 23:20',
u'300,00\xa0\u0440\u0443\u0431.',
u'',
u'\u041c\u0422\u0421',
u'']
[u'-',
u'04.02.2011 23:15',
u'450,00\xa0\u0440\u0443\u0431.',
u'',
u'\u041e\u043f\u043b\u0430\u0442\u0430 Interzet',
u'']
[u'-',
u'13.01.2011 02:05',
u'100,00\xa0\u0440\u0443\u0431.',
u'',
u'\u041c\u0422\u0421 kolombina',
u'']
Update 3 What is the source of these files? If they are being "saved as CSV" from Excel or OpenOffice Calc or Gnumeric, you could avoid the whole encoding drama by having them saved as "Excel 97-2003 Workbook (*.xls)" and use xlrd to read them. This would also save the hassles of having to inspect each csv file to determine the delimiter (comma vs semicolon), date format (31-01-11 vs 04.02.2011), and "decimal point" (5750.00 vs 450,00) -- all those differences presumably being created by saving as CSV. [Dis]claimer: I'm the author of xlrd
.
I don't know if you've already tried this, but in the example section for the official Python documentation for the csv module, you'll find a pair of classes; UnicodeReader
and UnicodeWriter
. They worked fine for me so far.
Correctly detecting the encoding of a file seems to be a very hard problem. You can read the discussion in this StackOverflow thread.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With