Parsing large (20GB) text file with python - reading in 2 lines as 1

Tags:

large-files

I'm parsing a 20Gb file and outputting lines that meet a certain condition to another file, however occasionally python will read in 2 lines at once and concatenate them.

inputFileHandle = open(inputFileName, 'r')

row = 0

for line in inputFileHandle:
    row =  row + 1
    if line_meets_condition:
        outputFileHandle.write(line)
    else:
        lstIgnoredRows.append(row)

I've checked the line endings in the source file and they check out as line feeds (ascii char 10). Pulling out the problem rows and parsing them in isolation works as expected. Am I hitting some python limitation here? The position in the file of the first anomaly is around the 4GB mark.

328

asked Apr 19 '12 02:04

James

2 Answers

Quick google search for "python reading files larger than 4gb" yielded many many results. See here for such an example and another one which takes over from the first.

It's a bug in Python.

Now, the explanation of the bug; it's not easy to reproduce because it depends both on the internal FILE buffer size and the number of chars passed to fread(). In the Microsoft CRT source code, in open.c, there is a block starting with this encouraging comment "This is the hard part. We found a CR at end of buffer. We must peek ahead to see if next char is an LF." Oddly, there is an almost exact copy of this function in Perl source code: http://perl5.git.perl.org/perl.git/blob/4342f4d6df6a7dfa22a470aa21e54a5622c009f3:/win32/win32.c#l3668 The problem is in the call to SetFilePointer(), used to step back one position after the lookahead; it will fail because it is unable to return the current position in a 32bit DWORD. [The fix is easy; do you see it?] At this point, the function thinks that the next read() will return the LF, but it won't because the file pointer was not moved back.

And the work-around:

But note that Python 3.x is not affected (raw files are always opened in binary mode and CRLF translation is done by Python); with 2.7, you may use io.open().

133

answered Sep 20 '22 13:09

Josh Smeaton

The 4GB mark is suspiciously near the maximum value that can be stored in a 32-bit register (2**32).

The code you've posted looks fine by itself, so I would suspect a bug in your Python build.

FWIW, the snippet would be a little cleaner if it used enumerate:

inputFileHandle = open(inputFileName, 'r')

for row, line in enumerate(inputFileHandle):
    if line_meets_condition:
        outputFileHandle.write(line)
    else:
        lstIgnoredRows.append(row)

answered Sep 21 '22 13:09

Raymond Hettinger

Related questions
                            
                                Convert unicode codepoint to UTF8 hex in python
                            
                                How to tell if a class is descended from another class
                            
                                TeamCity for Python/Django continuous integration
                            
                                increment int object
                            
                                Does Python have a "compile only" switch like Perl's -c?
                            
                                What is the equivalent of 'fread' from Matlab in Python?
                            
                                What is this cProfile result telling me I need to fix?
                            
                                Numpy: Concatenating multidimensional and unidimensional arrays
                            
                                file walking in python
                            
                                Python 256bit Hash function with number output
                            
                                Find all list permutations of splitting a string in Python
                            
                                How to raise error if duplicates keys in dictionary
                            
                                A loopless 3D matrix multiplication in python
                            
                                Fast way to remove a few items from a list/queue
                            
                                How to convert Python decimal to SQLite numeric?
                            
                                Fabric put command gives fatal error: 'No such file' exception
                            
                                Python: SSH into Cisco device and run show commands
                            
                                What is a good way to draw images using pygame?
                            
                                Cross product of a vector in NumPy
                            
                                Why is numpy much slower than matlab on a digitize example?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With