I'm reading through a large file, and processing it. I want to be able to jump to the middle of the file without it taking a long time.
right now I am doing:
f = gzip.open(input_name)
for i in range(1000000):
f.read() # just skipping the first 1M rows
for line in f:
do_something(line)
is there a faster way to skip the lines in the zipped file? If I have to unzip it first, I'll do that, but there has to be a way.
It's of course a text file, with \n
separating lines.
The nature of gzipping is such that there is no longer the concept of lines when the file is compressed -- it's just a binary blob. Check out this for an explanation of what gzip does.
To read the file, you'll need to decompress it -- the gzip
module does a fine job of it. Like other answers, I'd also recommend itertools
to do the jumping, as it will carefully make sure you don't pull things into memory, and it will get you there as fast as possible.
with gzip.open(filename) as f:
# jumps to `initial_row`
for line in itertools.slice(f, initial_row, None):
# have a party
Alternatively, if this is a CSV that you're going to be working with, you could also try clocking pandas
parsing, as it can handle decompressing gzip
. That would look like: parsed_csv = pd.read_csv(filename, compression='gzip')
.
Also, to be extra clear, when you iterate over file objects in python -- i.e. like the f
variable above -- you iterate over lines. You do not need to think about the '\n' characters.
You can use itertools.islice, passing a file object f
and starting point, it will still advance the iterator but more efficiently than calling next 1000000 times:
from itertools import islice
for line in islice(f,1000000,None):
print(line)
Not overly familiar with gzip but I imagine f.read()
reads the whole file so the next 999999 calls are doing nothing. If you wanted to manually advance the iterator you would call next on the file object i.e next(f)
.
Calling next(f)
won't mean all the lines are read into memory at once either, it advances the iterator one line at a time so if you want to skip a line or two in a file or a header it can be useful.
The consume recipe as @wwii suggested recipe is also worth checking out
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With