Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A resilient, actually working CSV implementation for non-ascii?

[Update] Appreciate the answers and input all around, but working code would be most welcome. If you can supply code that can read the sample files you are king (or queen).

[Update 2] Thanks for the excellent answers and discussion. What I need to do with these is to read them in, parse them, and save parts of them in Django model instances. I believe that means converting them from their native encoding to unicode so Django can deal with them, right?

There are several questions on Stackoverflow already on the subject of non-ascii python CSV reading, but the solutions shown there and in the python documentation don't work with the input files I'm trying.

The gist of the solution seems to be to encode('utf-8') the input to the CSV reader and unicode(item, 'utf-8') the output of the reader. However, this runs into UnicodeDecodeError issues (see above questions):

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa3 in position 8: unexpected

The input file is not necessarily in utf8; it can be ISO-8859-1, cp1251, or just about anything else.

So, the question: what's a resilient, cross-encoding capable way to read CSV files in Python?

The root of the issue seems to be that the CSV module is a C extension; is there a pure-python CSV reading module?

If not, is there a way to confidently detect the encoding of the input file so that it can be processed?

Basically I'm looking for a bullet proof way to read (and hopefully write) CSV files in any encoding.

Here are two sample files: European, Russian.

And here's the recommended solution failing:

Python 2.6.4 (r264:75821M, Oct 27 2009, 19:48:32)
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import csv
>>> def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
...     # csv.py doesn't do Unicode; encode temporarily as UTF-8:
...     csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
...                             dialect=dialect, **kwargs)
...     for row in csv_reader:
...         # decode UTF-8 back to Unicode, cell by cell:
...         yield [unicode(cell, 'utf-8') for cell in row]
...
>>> def utf_8_encoder(unicode_csv_data):
...     for line in unicode_csv_data:
...         yield line.encode('utf-8')
...
>>> r = unicode_csv_reader(file('sample-euro.csv').read().split('\n'))
>>> line = r.next()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 5, in unicode_csv_reader
  File "<stdin>", line 3, in utf_8_encoder
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf8 in position 14: ordinal not in range(128)
>>> r = unicode_csv_reader(file('sample-russian.csv').read().split('\n'))
>>> line = r.next()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 5, in unicode_csv_reader
  File "<stdin>", line 3, in utf_8_encoder
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 28: ordinal not in range(128)
like image 466
Parand Avatar asked Feb 16 '11 18:02

Parand


2 Answers

You are attempting to apply a solution to a different problem. Note this:

def utf_8_encoder(unicode_csv_data)

You are feeding it str objects.

The problems with reading your non-ASCII CSV files is that you don't know the encoding and you don't know the delimiter. If you do know the encoding (and it's an ASCII-based encoding (e.g. cp125x, any East Asian encoding, UTF-8, not UTF-16, not UTF-32)), and the delimiter, this will work:

for row in csv.reader("foo.csv", delimiter=known_delimiter):
   row = [item.decode(encoding) for item in row]

Your sample_euro.csv looks like cp1252 with comma delimiter. The Russian one looks like cp1251 with semicolon delimiter. By the way, it seems from the contents that you will also need to determine what date format is being used and maybe the currency also -- the Russian example has money amounts followed by a space and the Cyrillic abbreviation for "roubles".

Note carefully: Resist all attempts to persuade you that you have files encoded in ISO-8859-1. They are encoded in cp1252.

Update in response to comment """If I understand what you're saying I must know the encoding in order for this to work? In the general case I won't know the encoding and based on the other answer guessing the encoding is very difficult, so I'm out of luck?"""

You must know the encoding for ANY file-reading exercise to work.

Guessing the encoding correctly all the time for any encoding in any size file is not very difficult -- it's impossible. However restricting the scope to csv files saved out of Excel or Open Office in the user's locale's default encoding, and of a reasonable size, it's not such a big task. I'd suggest giving chardet a try; it guesses windows-1252 for your euro file and windows-1251 for your Russian file -- a fantastic achievement given their tiny size.

Update 2 in response to """working code would be most welcome"""

Working code (Python 2.x):

from chardet.universaldetector import UniversalDetector
chardet_detector = UniversalDetector()

def charset_detect(f, chunk_size=4096):
    global chardet_detector
    chardet_detector.reset()
    while 1:
        chunk = f.read(chunk_size)
        if not chunk: break
        chardet_detector.feed(chunk)
        if chardet_detector.done: break
    chardet_detector.close()
    return chardet_detector.result

# Exercise for the reader: replace the above with a class

import csv    
import sys
from pprint import pprint

pathname = sys.argv[1]
delim = sys.argv[2] # allegedly known
print "delim=%r pathname=%r" % (delim, pathname)

with open(pathname, 'rb') as f:
    cd_result = charset_detect(f)
    encoding = cd_result['encoding']
    confidence = cd_result['confidence']
    print "chardet: encoding=%s confidence=%.3f" % (encoding, confidence)
    # insert actions contingent on encoding and confidence here
    f.seek(0)
    csv_reader = csv.reader(f, delimiter=delim)
    for bytes_row in csv_reader:
        unicode_row = [x.decode(encoding) for x in bytes_row]
        pprint(unicode_row)

Output 1:

delim=',' pathname='sample-euro.csv'
chardet: encoding=windows-1252 confidence=0.500
[u'31-01-11',
 u'Overf\xf8rsel utland',
 u'UTLBET; ID 9710032001647082',
 u'1990.00',
 u'']
[u'31-01-11',
 u'Overf\xf8ring',
 u'OVERF\xd8RING MELLOM EGNE KONTI',
 u'5750.00',
 u';']

Output 2:

delim=';' pathname='sample-russian.csv'
chardet: encoding=windows-1251 confidence=0.602
[u'-',
 u'04.02.2011 23:20',
 u'300,00\xa0\u0440\u0443\u0431.',
 u'',
 u'\u041c\u0422\u0421',
 u'']
[u'-',
 u'04.02.2011 23:15',
 u'450,00\xa0\u0440\u0443\u0431.',
 u'',
 u'\u041e\u043f\u043b\u0430\u0442\u0430 Interzet',
 u'']
[u'-',
 u'13.01.2011 02:05',
 u'100,00\xa0\u0440\u0443\u0431.',
 u'',
 u'\u041c\u0422\u0421 kolombina',
 u'']

Update 3 What is the source of these files? If they are being "saved as CSV" from Excel or OpenOffice Calc or Gnumeric, you could avoid the whole encoding drama by having them saved as "Excel 97-2003 Workbook (*.xls)" and use xlrd to read them. This would also save the hassles of having to inspect each csv file to determine the delimiter (comma vs semicolon), date format (31-01-11 vs 04.02.2011), and "decimal point" (5750.00 vs 450,00) -- all those differences presumably being created by saving as CSV. [Dis]claimer: I'm the author of xlrd.

like image 83
John Machin Avatar answered Nov 17 '22 18:11

John Machin


I don't know if you've already tried this, but in the example section for the official Python documentation for the csv module, you'll find a pair of classes; UnicodeReader and UnicodeWriter. They worked fine for me so far.

Correctly detecting the encoding of a file seems to be a very hard problem. You can read the discussion in this StackOverflow thread.

like image 28
rubayeet Avatar answered Nov 17 '22 18:11

rubayeet