Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python read huge file line by line with utf-8 encoding

I want to read some quite huge files(to be precise: the google ngram 1 word dataset) and count how many times a character occurs. Now I wrote this script:

import fileinput
files = ['../../datasets/googlebooks-eng-all-1gram-20090715-%i.csv' % value for value in range(0,9)]
charcounts = {}
lastfile = ''
for line in fileinput.input(files):
    line = line.strip()
    data = line.split('\t')
    for character in list(data[0]):
        if (not character in charcounts):
            charcounts[character] = 0
        charcounts[character] += int(data[1])
    if (fileinput.filename() is not lastfile):
        print(fileinput.filename())
        lastfile = fileinput.filename()
    if(fileinput.filelineno() % 100000 == 0):
        print(fileinput.filelineno())
print(charcounts)

which works fine, until it reaches approx. line 700.000 of the first file, I then get this error:

../../datasets/googlebooks-eng-all-1gram-20090715-0.csv
100000
200000
300000
400000
500000
600000
700000
Traceback (most recent call last):
  File "charactercounter.py", line 5, in <module>
    for line in fileinput.input(files):
  File "C:\Python31\lib\fileinput.py", line 254, in __next__
    line = self.readline()
  File "C:\Python31\lib\fileinput.py", line 349, in readline
    self._buffer = self._file.readlines(self._bufsize)
  File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 7771: cha
racter maps to <undefined>

To solve this I searched the web a bit, and came up with this code:

import fileinput
files = ['../../datasets/googlebooks-eng-all-1gram-20090715-%i.csv' % value for value in range(0,9)]
charcounts = {}
lastfile = ''
for line in fileinput.input(files,False,'',0,'r',fileinput.hook_encoded('utf-8')):
    line = line.strip()
    data = line.split('\t')
    for character in list(data[0]):
        if (not character in charcounts):
            charcounts[character] = 0
        charcounts[character] += int(data[1])
    if (fileinput.filename() is not lastfile):
        print(fileinput.filename())
        lastfile = fileinput.filename()
    if(fileinput.filelineno() % 100000 == 0):
        print(fileinput.filelineno())
print(charcounts)

but the hook I now use tries to read the entire, 990MB, file into the memory at once, which kind of crashes my pc. Does anyone know how to rewrite this code so that it actually works?

p.s: the code hasn't even run all the way yet, so I don't even know if it does what it has to do, but for that to happen I first need to fix this bug.

Oh, and I use Python 3.2

like image 898
teuneboon Avatar asked Mar 30 '11 22:03

teuneboon


3 Answers

I do not know why fileinput does not work as expected.

I suggest you use the open function instead. The return value can be iterated over and will return lines, just like fileinput.

The code will then be something like:

for filename in files:
    print(filename)
    for filelineno, line in enumerate(open(filename, encoding="utf-8")):
        line = line.strip()
        data = line.split('\t')
        # ...

Some documentation links: enumerate, open, io.TextIOWrapper (open returns an instance of TextIOWrapper).

like image 82
codeape Avatar answered Nov 02 '22 11:11

codeape


The problem is that fileinput doesn't use file.xreadlines(), which reads line by line, but file.readline(bufsize), which reads bufsize bytes at once (and turns that into a list of lines). You are providing 0 for the bufsize parameter of fileinput.input() (which is also the default value). Bufsize 0 means that the whole file is buffered.

Solution: provide a reasonable bufsize.

like image 40
Steven Avatar answered Nov 02 '22 11:11

Steven


This works for me: you can use "utf-8" in the hook definition. I used it on a 50GB/200M lines file with no problem.

fi = fileinput.FileInput(openhook=fileinput.hook_encoded("iso-8859-1"))
like image 28
oba Avatar answered Nov 02 '22 09:11

oba