Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python UTF-16 CSV reader

Tags:

python

csv

utf-16

I have a UTF-16 CSV file which I have to read. Python csv module does not seem to support UTF-16.

I am using python 2.7.2. CSV files I need to parse are huge size running into several GBs of data.

Answers for John Machin questions below

print repr(open('test.csv', 'rb').read(100))

Output with test.csv having just abc as content

'\xff\xfea\x00b\x00c\x00'

I think csv file got created on windows machine in USA. I am using Mac OSX Lion.

If I use code provided by phihag and test.csv containing one record.

sample test.csv content used. Below is print repr(open('test.csv', 'rb').read(1000)) output

'\xff\xfe1\x00,\x002\x00,\x00G\x00,\x00S\x00,\x00H\x00 \x00f\x00\xfc\x00r\x00 \x00e\x00 \x00\x96\x00 \x00m\x00 \x00\x85\x00,\x00,\x00I\x00\r\x00\n\x00'

Code by phihag

import codecs
import csv
with open('test.csv','rb') as f:
      sr = codecs.StreamRecoder(f,codecs.getencoder('utf-8'),codecs.getdecoder('utf-8'),codecs.getreader('utf-16'),codecs.getwriter('utf-16'))      
      for row in csv.reader(sr):
         print row

Output of the above code

['1', '2', 'G', 'S', 'H f\xc3\xbcr e \xc2\x96 m \xc2\x85']
['', '', 'I']

expected output is

['1', '2', 'G', 'S', 'H f\xc3\xbcr e \xc2\x96 m \xc2\x85','','I']
like image 764
venky Avatar asked Feb 07 '12 14:02

venky


People also ask

How do I convert a CSV file to UTF 16?

Open LibreOffice and click Open File along the left. Select your file and Open. Click File > Save As... The following window will appear, change the File Type to Text CSV and select the Edit filter settings option, then click Save.

How do I read and print a csv file in Python?

csv file in reading mode using open() function. Then, the csv. reader() is used to read the file, which returns an iterable reader object. The reader object is then iterated using a for loop to print the contents of each row.


1 Answers

At the moment, the csv module does not support UTF-16.

In Python 3.x, csv expects a text-mode file and you can simply use the encoding parameter of open to force another encoding:

# Python 3.x only
import csv
with open('utf16.csv', 'r', encoding='utf16') as csvf:
    for line in csv.reader(csvf):
        print(line) # do something with the line

In Python 2.x, you can recode the input:

# Python 2.x only
import codecs
import csv

class Recoder(object):
    def __init__(self, stream, decoder, encoder, eol='\r\n'):
        self._stream = stream
        self._decoder = decoder if isinstance(decoder, codecs.IncrementalDecoder) else codecs.getincrementaldecoder(decoder)()
        self._encoder = encoder if isinstance(encoder, codecs.IncrementalEncoder) else codecs.getincrementalencoder(encoder)()
        self._buf = ''
        self._eol = eol
        self._reachedEof = False

    def read(self, size=None):
        r = self._stream.read(size)
        raw = self._decoder.decode(r, size is None)
        return self._encoder.encode(raw)

    def __iter__(self):
        return self

    def __next__(self):
        if self._reachedEof:
            raise StopIteration()
        while True:
            line,eol,rest = self._buf.partition(self._eol)
            if eol == self._eol:
                self._buf = rest
                return self._encoder.encode(line + eol)
            raw = self._stream.read(1024)
            if raw == '':
                self._decoder.decode(b'', True)
                self._reachedEof = True
                return self._encoder.encode(self._buf)
            self._buf += self._decoder.decode(raw)
    next = __next__

    def close(self):
        return self._stream.close()

with open('test.csv','rb') as f:
    sr = Recoder(f, 'utf-16', 'utf-8')

    for row in csv.reader(sr):
        print (row)

open and codecs.open require the file to start with a BOM. If it doesn't (or you're on Python 2.x), you can still convert it in memory, like this:

try:
    from io import BytesIO
except ImportError: # Python < 2.6
    from StringIO import StringIO as BytesIO
import csv
with open('utf16.csv', 'rb') as binf:
    c = binf.read().decode('utf-16').encode('utf-8')
for line in csv.reader(BytesIO(c)):
    print(line) # do something with the line
like image 164
phihag Avatar answered Oct 16 '22 22:10

phihag