Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Seek on a large text file python

Tags:

python

seek

fseek

I have a few text files whose sizes range between 5 gigs and 50 gigs. I am using Python to read them. I have specific anchors in terms of byte offsets, to which I can seek and read the corresponding data from each of these files (using Python's file api).

The issue that I am seeing is that for relatively smaller files (< 5 gigs), this reading approach works well. However, for the much larger files (> 20 gigs) and especially when the file.seek function has to take longer jumps (like a few multi-million bytes at a time), it (sometimes) takes a few hundred milliseconds for it to do so.

My impression was that seek operations within the files are constant time operations. But apparently, they are not. Is there a way around it?

Here is what I am doing:

import time

f = open(filename, 'r+b')
f.seek(209)
current = f.tell()
t1 = time.time()
next = f.seek(current + 1200000000)
t2 = time.time()
line = f.readline()
delta = t2 - t1

The delta variable is varying between few microseconds to few hundreeld milliseconds, intermittently. I also profiled the cpu usage, and didnt see anything busy there as well.

like image 311
khan Avatar asked Nov 07 '22 15:11

khan


1 Answers

Your code runs consistently in under 10 microseconds on my system (Windows 10, Python 3.7), so there is no obvious error in your code.

NB: You should use time.perf_counter() instead of time.time() for measuring performance. The granularity of time.time() can be very bad ("not all systems provide time with a better precision than 1 second"). When comparing timings with other systems you may get strange results.

My best guess is that the seek triggers some buffering (read-ahead) action, which might be slow, depending on your system.

According to the documentation:

Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device’s “block size” and falling back on io.DEFAULT_BUFFER_SIZE. On many systems, the buffer will typically be 4096 or 8192 bytes long.

You could try to disable buffering by adding the argument buffering=0 to open() and check if that makes a difference:

open(filename, 'r+b', buffering=0)
like image 122
wovano Avatar answered Nov 14 '22 23:11

wovano