Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing large (20GB) text file with python - reading in 2 lines as 1

I'm parsing a 20Gb file and outputting lines that meet a certain condition to another file, however occasionally python will read in 2 lines at once and concatenate them.

inputFileHandle = open(inputFileName, 'r')

row = 0

for line in inputFileHandle:
    row =  row + 1
    if line_meets_condition:
        outputFileHandle.write(line)
    else:
        lstIgnoredRows.append(row)

I've checked the line endings in the source file and they check out as line feeds (ascii char 10). Pulling out the problem rows and parsing them in isolation works as expected. Am I hitting some python limitation here? The position in the file of the first anomaly is around the 4GB mark.

like image 328
James Avatar asked Apr 19 '12 02:04

James


People also ask

How do I read a 10gb file in Python?

In Python, files are read by using the readlines() method. The readlines() method returns a list where each item of the list is a complete sentence in the file. This method is useful when the file size is small.

How do you handle large files in Python?

Reading Large Text Files in Python We can use the file object as an iterator. The iterator will return each line one by one, which can be processed. This will not read the whole file into memory and it's suitable to read large files in Python.


2 Answers

Quick google search for "python reading files larger than 4gb" yielded many many results. See here for such an example and another one which takes over from the first.

It's a bug in Python.

Now, the explanation of the bug; it's not easy to reproduce because it depends both on the internal FILE buffer size and the number of chars passed to fread(). In the Microsoft CRT source code, in open.c, there is a block starting with this encouraging comment "This is the hard part. We found a CR at end of buffer. We must peek ahead to see if next char is an LF." Oddly, there is an almost exact copy of this function in Perl source code: http://perl5.git.perl.org/perl.git/blob/4342f4d6df6a7dfa22a470aa21e54a5622c009f3:/win32/win32.c#l3668 The problem is in the call to SetFilePointer(), used to step back one position after the lookahead; it will fail because it is unable to return the current position in a 32bit DWORD. [The fix is easy; do you see it?] At this point, the function thinks that the next read() will return the LF, but it won't because the file pointer was not moved back.

And the work-around:

But note that Python 3.x is not affected (raw files are always opened in binary mode and CRLF translation is done by Python); with 2.7, you may use io.open().

like image 133
Josh Smeaton Avatar answered Sep 20 '22 13:09

Josh Smeaton


The 4GB mark is suspiciously near the maximum value that can be stored in a 32-bit register (2**32).

The code you've posted looks fine by itself, so I would suspect a bug in your Python build.

FWIW, the snippet would be a little cleaner if it used enumerate:

inputFileHandle = open(inputFileName, 'r')

for row, line in enumerate(inputFileHandle):
    if line_meets_condition:
        outputFileHandle.write(line)
    else:
        lstIgnoredRows.append(row)
like image 35
Raymond Hettinger Avatar answered Sep 21 '22 13:09

Raymond Hettinger