Let's say that I routinely have to work with files with an unknown, but large, number of lines. Each line contains a set of integers (space, comma, semicolon, or some non-numeric character is the delimiter) in the closed interval [0, R], where R can be arbitrarily large. The number of integers on each line can be variable. Often times I get the same number of integers on each line, but occasionally I have lines with unequal sets of numbers.
Suppose I want to go to Nth line in the file and retrieve the Kth number on that line (and assume that the inputs N and K are valid --- that is, I am not worried about bad inputs). How do I go about doing this efficiently in Python 3.1.2 for Windows?
I do not want to traverse the file line by line.
I tried using mmap, but while poking around here on SO, I learned that that's probably not the best solution on a 32-bit build because of the 4GB limit. And in truth, I couldn't really figure out how to simply move N lines away from my current position. If I can at least just "jump" to the Nth line then I can use .split() and grab the Kth integer that way.
The nuance here is that I don't just need to grab one line from the file. I will need to grab several lines: they are not necessarily all near each other, the order in which I get them matters, and the order is not always based on some deterministic function.
Any ideas? I hope this is enough information.
Thanks!
Python's seek
goes to a byte offset in a file, not to a line offset, simply because that's the way modern operating systems and their filesystems work -- the OS/FS just don't record or remember "line offsets" in any way whatsoever, and there's no way for Python (or any other language) to just magically guess them. Any operation purporting to "go to a line" will inevitably need to "walk through the file" (under the covers) to make the association between line numbers and byte offsets.
If you're OK with that and just want it hidden from your sight, then the solution is the standard library module linecache -- but performance won't be any better than that of code you could write yourself.
If you need to read from the same large file multiple times, a large optimization would be to run once on that large file a script that builds and saves to disk the line number - to - byte offset correspondence (technically an "index" auxiliary file); then, all your successive runs (until the large file changes) could very speedily use the index file to navigate with very high performance through the large file. Is this your use case...?
Edit: since apparently this may apply -- here's the general idea (net of careful testing, error checking, or optimization;-). To make the index, use makeindex.py
, as follows:
import array
import sys
BLOCKSIZE = 1024 * 1024
def reader(f):
blockstart = 0
while True:
block = f.read(BLOCKSIZE)
if not block: break
inblock = 0
while True:
nextnl = block.find(b'\n', inblock)
if nextnl < 0:
blockstart += len(block)
break
yield nextnl + blockstart
inblock = nextnl + 1
def doindex(fn):
with open(fn, 'rb') as f:
# result format: x[0] is tot # of lines,
# x[N] is byte offset of END of line N (1+)
result = array.array('L', [0])
result.extend(reader(f))
result[0] = len(result) - 1
return result
def main():
for fn in sys.argv[1:]:
index = doindex(fn)
with open(fn + '.indx', 'wb') as p:
print('File', fn, 'has', index[0], 'lines')
index.tofile(p)
main()
and then to use it, for example, the following useindex.py
:
import array
import sys
def readline(n, f, findex):
f.seek(findex[n] + 1)
bytes = f.read(findex[n+1] - findex[n])
return bytes.decode('utf8')
def main():
fn = sys.argv[1]
with open(fn + '.indx', 'rb') as f:
findex = array.array('l')
findex.fromfile(f, 1)
findex.fromfile(f, findex[0])
findex[0] = -1
with open(fn, 'rb') as f:
for n in sys.argv[2:]:
print(n, repr(readline(int(n), f, findex)))
main()
Here's an example (on my slow laptop):
$ time py3 makeindex.py kjv10.txt
File kjv10.txt has 100117 lines
real 0m0.235s
user 0m0.184s
sys 0m0.035s
$ time py3 useindex.py kjv10.txt 12345 98765 33448
12345 '\r\n'
98765 '2:6 But this thou hast, that thou hatest the deeds of the\r\n'
33448 'the priest appointed officers over the house of the LORD.\r\n'
real 0m0.049s
user 0m0.028s
sys 0m0.020s
$
The sample file is a plain text file of King James' Bible:
$ wc kjv10.txt
100117 823156 4445260 kjv10.txt
100K lines, 4.4 MB, as you can see; this takes about a quarter second to index and 50 milliseconds to read and print out three random-y lines (no doubt this can be vastly accelerated with more careful optimization and a better machine). The index in memory (and on disk too) takes 4 bytes per line of the textfile being indexed, and performance should scale in a perfectly linear way, so if you had about 100 million lines, 4.4 GB, I would expect about 4-5 minutes to build the index, a minute to extract and print out three arbitrary lines (and the 400 MB of RAM taken for the index should not inconvenience even a small machine -- even my tiny slow laptop has 2GB after all;-).
You can also see that (for speed and convenience) I treat the file as binary (and assume utf8 encoding -- works with any subset like ASCII too of course, eg that KJ text file is ASCII) and don't bother collapsing \r\n
into a single character if that's what the file has as line terminator (it's pretty trivial to do that after reading each line if you want).
The problem is that since your lines are not of fixed length, you have to pay attention to line end markers to do your seeking, and that effectively becomes "traversing the file line by line". Thus, any viable approach is still going to be traversing the file, it's merely a matter of what can traverse it fastest.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With