Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In python, can one iterate through large text files using buffers and get the correct file position at the same time?

Tags:

python

file-io

I'm trying to search some keywords through a large text file (~232GB). I want to take advantage of buffering for speed concerns and also want to record beginning positions of lines containing those keywords.

I've seen many posts here discussing similar questions. However, those solutions with buffering (use file as iterator) cannot give correct file position, and those solutions give correct file positions usually simply use f.readline(), which does not use buffering.

The only answer I saw that can do both is here:

# Read in the file once and build a list of line offsets
line_offset = []
offset = 0
for line in file:
    line_offset.append(offset)
    offset += len(line)
file.seek(0)

# Now, to skip to line n (with the first line being line 0), just do
file.seek(line_offset[n])

However, I'm not sure whether the offset += len(line) operation will cost unnecessary time. Is there a more direct way to do this?

UPDATE:

I've done some timing but it seems that .readline() is much slower than using file object as an iterator, on python 2.7.3. I used the following code

#!/usr/bin/python

from timeit import timeit

MAX_LINES = 10000000

# use file object as iterator
def read_iter(): 
    with open('tweets.txt','r') as f:
        lino = 0
        for line in f:
            lino+=1
            if lino == MAX_LINES:
                break

# use .readline()
def read_readline(): 
    with open('tweets.txt', 'r') as f:
        lino = 0
        for line in iter(f.readline,''):
            lino+=1
            if lino == MAX_LINES:
                break

# use offset+=len(line) to simulate f.tell() under binary mode
def read_iter_tell(): 
    offset = 0
    with open('tweets.txt','rb') as f:
        lino = 0
        for line in f:
            lino+=1
            offset+=len(line)
            if lino == MAX_LINES:
                break

# use f.tell() with .readline()
def read_readline_tell():
    with open('tweets.txt', 'rb') as f:
        lino = 0
        for line in iter(f.readline,''):
            lino+=1
            offset = f.tell()
            if lino == MAX_LINES:
                break

print ("iter: %f"%timeit("read_iter()",number=1,setup="from __main__ import read_iter"))
print ("readline: %f"%timeit("read_readline()",number=1,setup="from __main__ import read_readline"))
print ("iter_tell: %f"%timeit("read_iter_tell()",number=1,setup="from __main__ import read_iter_tell"))
print ("readline_tell: %f"%timeit("read_readline_tell()",number=1,setup="from __main__ import read_readline_tell"))

And the result is like:

iter: 5.079951
readline: 37.333189
iter_tell: 5.775822
readline_tell: 38.629598
like image 213
Roun Avatar asked Oct 20 '13 00:10

Roun


1 Answers

What's wrong with using .readline()?

The sample you found is incorrect for files opened in text mode. It should work OK on Linux systems, but not on Windows. On Windows, the only way to return to a former position in a text-mode file is to seek to one of:

  1. 0 (start of file).

  2. End of file.

  3. A position formerly returned by f.tell().

You cannot compute text-mode file positions in any portable way.

So use .readline(), and/or .read(), and .tell(). Problem solved ;-)

About buffering: whether buffering is used has nothing to do with how a file is accessed; it has entirely to do with how the file is opened. Buffering is an implementation detail. In particular, f.readline() certainly is buffered under the covers (unless you explicitly disabled buffering in your file open() call), but in a way that isn't visible to you. The problems found with using a file object as an iterator have to do with an additional layer of buffering added by the file iterator implementation (which the file.next() docs call "a hidden read-ahead buffer").

To answer your other question, the expense of:

offset += len(line)

is trivial - but, as noted before, that "solution" has real problems.

Short course: don't get prematurely tricky. Do the simplest thing that works (like .readline() + .tell()), and start worrying only if that proves to be inadequate.

More details

There are actually several layers of buffering going on. Down in the hardware, your disk drive has memory buffers inside it. Above that, your operating system maintains memory buffers too, and typically tries to be "smart" when you access a file in a uniform pattern, asking the disk drive to "read ahead" disk blocks in the direction you're reading, beyond the blocks you've already asked for.

CPython's I/O builds on top of the platform C's I/O libraries. The C libraries have their own memory buffers. For Python's f.tell() to "work right", CPython has to use the C libraries in ways C dictates.

Now there's nothing special in any of this about "a line" (well, not on any of the major operating systems). "A line" is a software concept, typically meaning just "up to and including the next \n byte (Linux), \r byte (some Mac flavors), or \r\n byte pair (Windows). The hardware, OS, and C buffers typically don't know anything about "lines" - they just work with a stream of bytes.

Under the covers, Python's .readline() essentially "reads" one byte at a time until it sees the platform's end-of-line byte sequence (\n, \r, or \r\n). I put "reads" in quotes because there's typically no disk access involved - it's typically just software at the various levels copying bytes from their memory buffers. When a disk access is involved, it's many thousands of times slower.

By doing this "one byte at a time", the C level libraries maintain correct results for f.tell(). But at a cost: there may be layers of function calls for each byte obtained.

Python's file iterator "reads" chunks of bytes at a time, into its own memory buffer. "How many" doesn't matter ;-) What matters is that it asks the C library to copy over multiple bytes at a time, and then CPython searches through its own memory buffer for end-of-line sequences. This slashes the number of function calls required. But at a different kind of cost: the C library's idea of where we are in the file reflects the number of bytes read into the file iterator's memory buffer, which has nothing in particular to do with the number of bytes the user's Python program has retrieved from that buffer.

So, yes indeed, for line in file: is typically the fastest way to get through a whole text file line by line.

Does it matter? The only way to know for sure is to time it on real data. With a 200+ GB file to read, you're going to be spending thousands of times more time doing physical disk reads than the various layers of software take to search for end-of-line byte sequences.

If it turns out it does matter, and your data and OS are such that you can open the file in binary mode and still get correct results, then the snippet of code you found will give the best of both worlds (fastest line iteration, and correct byte positions for later .seek()'ing).

like image 171
Tim Peters Avatar answered Sep 25 '22 14:09

Tim Peters