python jump to a line in a txt file (a gzipped one)

Question

I'm reading through a large file, and processing it. I want to be able to jump to the middle of the file without it taking a long time.

right now I am doing:

f = gzip.open(input_name)
for i in range(1000000):
    f.read() # just skipping the first 1M rows

for line in f:
    do_something(line)

is there a faster way to skip the lines in the zipped file? If I have to unzip it first, I'll do that, but there has to be a way.

It's of course a text file, with separating lines.

jwilner · Accepted Answer

The nature of gzipping is such that there is no longer the concept of lines when the file is compressed -- it's just a binary blob. Check out this for an explanation of what gzip does.

To read the file, you'll need to decompress it -- the gzip module does a fine job of it. Like other answers, I'd also recommend itertools to do the jumping, as it will carefully make sure you don't pull things into memory, and it will get you there as fast as possible.

with gzip.open(filename) as f:
    # jumps to `initial_row`
    for line in itertools.slice(f, initial_row, None):
        # have a party

Alternatively, if this is a CSV that you're going to be working with, you could also try clocking pandas parsing, as it can handle decompressing gzip. That would look like: parsed_csv = pd.read_csv(filename, compression='gzip').

Also, to be extra clear, when you iterate over file objects in python -- i.e. like the f variable above -- you iterate over lines. You do not need to think about the ' ' characters.

Padraic Cunningham · Answer

You can use itertools.islice, passing a file object f and starting point, it will still advance the iterator but more efficiently than calling next 1000000 times:

from itertools import islice

for line in islice(f,1000000,None):
     print(line)

Not overly familiar with gzip but I imagine f.read() reads the whole file so the next 999999 calls are doing nothing. If you wanted to manually advance the iterator you would call next on the file object i.e next(f).

Calling next(f) won't mean all the lines are read into memory at once either, it advances the iterator one line at a time so if you want to skip a line or two in a file or a header it can be useful.

The consume recipe as @wwii suggested recipe is also worth checking out

python jump to a line in a txt file (a gzipped one)

Tags:

python

file-io

eran

2 Answers

jwilner

Padraic Cunningham

Recent Activity

Donate For Us

python jump to a line in a txt file (a gzipped one)

Tags:

python

file-io

eran

2 Answers

jwilner

Padraic Cunningham

Related questions

Recent Activity

Donate For Us