Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

General Unicode/UTF-8 support for csv files in Python 2.6

The csv module in Python doesn't work properly when there's UTF-8/Unicode involved. I have found, in the Python documentation and on other webpages, snippets that work for specific cases but you have to understand well what encoding you are handling and use the appropriate snippet.

How can I read and write both strings and Unicode strings from .csv files that "just works" in Python 2.6? Or is this a limitation of Python 2.6 that has no simple solution?

like image 981
djen Avatar asked Dec 04 '09 10:12

djen


People also ask

Does CSV support UTF-8?

and CSV files. Simple CSV files do not support Unicode/UTF-8 characters. This is a limitation of the CSV format and not something that can be changed in DEAR. However, it is possible to import/export Unicode characters following these steps.

What is UTF-8 encoded CSV file?

UTF-8, or "Unicode Transformation Format, 8 Bit" is a marketing operations pro's best friend when it comes to data imports and exports. It refers to how a file's character data is encoded when moving files between systems.


1 Answers

The example code of how to read Unicode given at http://docs.python.org/library/csv.html#examples looks to be obsolete, as it doesn't work with Python 2.6 and 2.7.

Here follows UnicodeDictReader which works with utf-8 and may be with other encodings, but I only tested it on utf-8 inputs.

The idea in short is to decode Unicode only after a csv row has been split into fields by csv.reader.

class UnicodeCsvReader(object):     def __init__(self, f, encoding="utf-8", **kwargs):         self.csv_reader = csv.reader(f, **kwargs)         self.encoding = encoding      def __iter__(self):         return self      def next(self):         # read and split the csv row into fields         row = self.csv_reader.next()          # now decode         return [unicode(cell, self.encoding) for cell in row]      @property     def line_num(self):         return self.csv_reader.line_num  class UnicodeDictReader(csv.DictReader):     def __init__(self, f, encoding="utf-8", fieldnames=None, **kwds):         csv.DictReader.__init__(self, f, fieldnames=fieldnames, **kwds)         self.reader = UnicodeCsvReader(f, encoding=encoding, **kwds) 

Usage (source file encoding is utf-8):

csv_lines = (     "абв,123",     "где,456", )  for row in UnicodeCsvReader(csv_lines):     for col in row:         print(type(col), col) 

Output:

$ python test.py <type 'unicode'> абв <type 'unicode'> 123 <type 'unicode'> где <type 'unicode'> 456 
like image 65
Maxim Egorushkin Avatar answered Oct 03 '22 16:10

Maxim Egorushkin