I have a few text files whose sizes range between 5 gigs and 50 gigs. I am using Python to read them. I have specific anchors in terms of byte offsets, to which I can seek
and read the corresponding data from each of these files (using Python's file
api).
The issue that I am seeing is that for relatively smaller files (< 5 gigs), this reading approach works well. However, for the much larger files (> 20 gigs) and especially when the file.seek
function has to take longer jumps (like a few multi-million bytes at a time), it (sometimes) takes a few hundred milliseconds for it to do so.
My impression was that seek operations within the files are constant time operations. But apparently, they are not. Is there a way around it?
Here is what I am doing:
import time
f = open(filename, 'r+b')
f.seek(209)
current = f.tell()
t1 = time.time()
next = f.seek(current + 1200000000)
t2 = time.time()
line = f.readline()
delta = t2 - t1
The delta
variable is varying between few microseconds to few hundreeld milliseconds, intermittently. I also profiled the cpu usage, and didnt see anything busy there as well.
Your code runs consistently in under 10 microseconds on my system (Windows 10, Python 3.7), so there is no obvious error in your code.
NB: You should use time.perf_counter()
instead of time.time()
for measuring performance. The granularity of time.time()
can be very bad ("not all systems provide time with a better precision than 1 second"). When comparing timings with other systems you may get strange results.
My best guess is that the seek triggers some buffering (read-ahead) action, which might be slow, depending on your system.
According to the documentation:
Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device’s “block size” and falling back on
io.DEFAULT_BUFFER_SIZE
. On many systems, the buffer will typically be 4096 or 8192 bytes long.
You could try to disable buffering by adding the argument buffering=0
to open()
and check if that makes a difference:
open(filename, 'r+b', buffering=0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With