Python fastest access to line in file

Tags:

I have an ASCII table in a file from which I want to read a particular set of lines (e.g. lines 4003 to 4005). The issue is that this file could be very very long (e.g. 100's of thousands to millions of lines), and I'd like to do this as quickly as possible.

a) Bad Solution: Read in the entire file, and go to those lines,

f = open('filename')
lines = f.readlines()[4003:4005]

b) Better Solution: enumerate over each line so that it's not all in memory (a la https://stackoverflow.com/a/2081880/230468)

f = open('filename')
lines = []
for i, line in enumerate(f):
    if i >= 4003 and i <= 4005: lines.append(line)
    if i > 4005: break                                    # @Wooble

c) Best Solution?

But b) still requires going through each line.

Is there a better (in terms of speed/efficiency) method of accessing a particular line from a huge file?

Should I use a linecache even though I will only access the file once (typically)?
Using a binary file instead, in which case it might be easier to skip-ahead, is an option --- but I'd much rather avoid it.

584

asked Oct 04 '13 20:10

DilithiumMatrix

1 Answers

I would probably just use itertools.islice. Using islice over an iterable like a file handle means the whole file is never read into memory, and the first 4002 lines are discarded as quickly as possible. You could even cast the two lines you need into a list pretty cheaply (assuming the lines themselves aren't very long). Then you can exit the with block, closing the filehandle.

from itertools import islice
with open('afile') as f:
    lines = list(islice(f, 4003, 4005))
do_something_with(lines)

Update

But holy cow is linecache faster for multiple accesses. I created a million-line file to compare islice and linecache and linecache blew it away.

>>> timeit("x=islice(open('afile'), 4003, 4005); print next(x) + next(x)", 'from itertools import islice', number=1)
4003
4004

0.00028586387634277344
>>> timeit("print getline('afile', 4003) + getline('afile', 4004)", 'from linecache import getline', number=1)
4002
4003

2.193450927734375e-05

>>> timeit("getline('afile', 4003) + getline('afile', 4004)", 'from linecache import getline', number=10**5)
0.14125394821166992
>>> timeit("''.join(islice(open('afile'), 4003, 4005))", 'from itertools import islice', number=10**5)
14.732316970825195

Constantly re-importing and re-reading the file:

This is not a practical test, but even re-importing linecache at each step it's only a second slower than islice.

>>> timeit("from linecache import getline; getline('afile', 4003) + getline('afile', 4004)", number=10**5)
15.613967180252075

Conclusion

Yes, linecache is faster than islice for all but constantly re-creating the linecache, but who does that? For the likely scenarios (reading only a few lines, once, and reading many lines, once) linecache is faster and presents a terse syntax, but the islice syntax is quite clean and fast as well and doesn't ever read the whole file into memory. On a RAM-tight environment, the islice solution may be the right choice. For very high speed requirements, linecache may be the better choice. Practically, though, in most environments both times are small enough it almost doesn't matter.

150

answered Oct 17 '22 08:10

kojiro

Related questions
                            
                                Why do I get an 'Unhandled exception in event loop' error on ipython
                            
                                pd.NA vs np.nan for pandas
                            
                                HTML to IMAGE using Python
                            
                                Reading numeric Excel data as text using xlrd in Python
                            
                                Paramiko SSH exec_command (shell script) returns before completion
                            
                                Plotting mplot3d / axes3D xyz surface plot with log scale?
                            
                                Matplotlib svg as string and not a file
                            
                                What's difference between a simple webserver and Apache server?
                            
                                How would you properly break this line to match pep8 rules?
                            
                                When are objects garbage collected in python?
                            
                                How do you see the return value from a function in the Python debugger, without an intermediate?
                            
                                Is there a Spock-like testing library for Python
                            
                                Goodness of fit tests in SciPy
                            
                                New to flask and Flask-Login - ImportError: No module named login
                            
                                Array elementwise operations
                            
                                Running sum in pandas (without loop)
                            
                                Python: Check if element is not in two lists?
                            
                                Convert 1D array into numpy matrix
                            
                                Why does this snippet with a shebang #!/bin/sh and exec python inside 4 single quotes work?
                            
                                Python list comprehension for loops

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python fastest access to line in file

Tags:

performance

python

file

io