I have an ASCII table in a file from which I want to read a particular set of lines (e.g. lines 4003 to 4005). The issue is that this file could be very very long (e.g. 100's of thousands to millions of lines), and I'd like to do this as quickly as possible.
a) Bad Solution: Read in the entire file, and go to those lines,
f = open('filename')
lines = f.readlines()[4003:4005]
b) Better Solution: enumerate
over each line so that it's not all in memory (a la https://stackoverflow.com/a/2081880/230468)
f = open('filename')
lines = []
for i, line in enumerate(f):
if i >= 4003 and i <= 4005: lines.append(line)
if i > 4005: break # @Wooble
c) Best Solution?
But b) still requires going through each line.
Is there a better (in terms of speed/efficiency) method of accessing a particular line from a huge file?
When you use a goto statement in Python, you are basically instructing the interpreter to directly execute another line of code instead of the current one. The target line of code which you want the interpreter to execute at this moment needs to be marked in the section termed “label”.
To read a text file in Python, you follow these steps: First, open a text file for reading by using the open() function. Second, read text from the text file using the file read() , readline() , or readlines() method of the file object. Third, close the file using the file close() method.
Method 1: fileobject.readlines() A file object can be created in Python and then readlines() method can be invoked on this object to read lines into a stream. This method is preferred when a single line or a range of lines from a file needs to be accessed simultaneously.
Python does not allow you to go back to a specific line number, and even if it did, you should not take advantage of that capability because it results in unmaintainable programs.
I would probably just use itertools.islice
. Using islice over an iterable like a file handle means the whole file is never read into memory, and the first 4002 lines are discarded as quickly as possible. You could even cast the two lines you need into a list pretty cheaply (assuming the lines themselves aren't very long). Then you can exit the with
block, closing the filehandle.
from itertools import islice
with open('afile') as f:
lines = list(islice(f, 4003, 4005))
do_something_with(lines)
But holy cow is linecache faster for multiple accesses. I created a million-line file to compare islice and linecache and linecache blew it away.
>>> timeit("x=islice(open('afile'), 4003, 4005); print next(x) + next(x)", 'from itertools import islice', number=1)
4003
4004
0.00028586387634277344
>>> timeit("print getline('afile', 4003) + getline('afile', 4004)", 'from linecache import getline', number=1)
4002
4003
2.193450927734375e-05
>>> timeit("getline('afile', 4003) + getline('afile', 4004)", 'from linecache import getline', number=10**5)
0.14125394821166992
>>> timeit("''.join(islice(open('afile'), 4003, 4005))", 'from itertools import islice', number=10**5)
14.732316970825195
This is not a practical test, but even re-importing linecache at each step it's only a second slower than islice.
>>> timeit("from linecache import getline; getline('afile', 4003) + getline('afile', 4004)", number=10**5)
15.613967180252075
Yes, linecache is faster than islice for all but constantly re-creating the linecache, but who does that? For the likely scenarios (reading only a few lines, once, and reading many lines, once) linecache is faster and presents a terse syntax, but the islice
syntax is quite clean and fast as well and doesn't ever read the whole file into memory. On a RAM-tight environment, the islice
solution may be the right choice. For very high speed requirements, linecache may be the better choice. Practically, though, in most environments both times are small enough it almost doesn't matter.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With