Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python jump to a line in a txt file (a gzipped one)

Tags:

python

file-io

I'm reading through a large file, and processing it. I want to be able to jump to the middle of the file without it taking a long time.

right now I am doing:

f = gzip.open(input_name)
for i in range(1000000):
    f.read() # just skipping the first 1M rows

for line in f:
    do_something(line)

is there a faster way to skip the lines in the zipped file? If I have to unzip it first, I'll do that, but there has to be a way.

It's of course a text file, with \n separating lines.

like image 586
eran Avatar asked Apr 19 '15 23:04

eran


2 Answers

The nature of gzipping is such that there is no longer the concept of lines when the file is compressed -- it's just a binary blob. Check out this for an explanation of what gzip does.

To read the file, you'll need to decompress it -- the gzip module does a fine job of it. Like other answers, I'd also recommend itertools to do the jumping, as it will carefully make sure you don't pull things into memory, and it will get you there as fast as possible.

with gzip.open(filename) as f:
    # jumps to `initial_row`
    for line in itertools.slice(f, initial_row, None):
        # have a party

Alternatively, if this is a CSV that you're going to be working with, you could also try clocking pandas parsing, as it can handle decompressing gzip. That would look like: parsed_csv = pd.read_csv(filename, compression='gzip').

Also, to be extra clear, when you iterate over file objects in python -- i.e. like the f variable above -- you iterate over lines. You do not need to think about the '\n' characters.

like image 135
jwilner Avatar answered Oct 01 '22 05:10

jwilner


You can use itertools.islice, passing a file object f and starting point, it will still advance the iterator but more efficiently than calling next 1000000 times:

from itertools import islice

for line in islice(f,1000000,None):
     print(line)

Not overly familiar with gzip but I imagine f.read() reads the whole file so the next 999999 calls are doing nothing. If you wanted to manually advance the iterator you would call next on the file object i.e next(f).

Calling next(f) won't mean all the lines are read into memory at once either, it advances the iterator one line at a time so if you want to skip a line or two in a file or a header it can be useful.

The consume recipe as @wwii suggested recipe is also worth checking out

like image 39
Padraic Cunningham Avatar answered Oct 01 '22 07:10

Padraic Cunningham