So I have some fairly gigantic .gz files - we're talking 10 to 20 gb each when decompressed.
I need to loop through each line of them, so I'm using the standard:
import gzip
f = gzip.open(path+myFile, 'r')
for line in f.readlines():
#(yadda yadda)
f.close()
However, both the open()
and close()
commands take AGES, using up 98% of the memory+CPU. So much so that the program exits and prints Killed
to the terminal. Maybe it is loading the entire extracted file into memory?
I'm now using something like:
from subprocess import call
f = open(path+'myfile.txt', 'w')
call(['gunzip', '-c', path+myfile], stdout=f)
#do some looping through the file
f.close()
#then delete extracted file
This works. But is there a cleaner way?
I'm 99% sure that your problem is not in the gzip.open()
, but in the readlines()
.
As the documentation explains:
f.readlines() returns a list containing all the lines of data in the file.
Obviously, that requires reading reading and decompressing the entire file, and building up an absolutely gigantic list.
Most likely, it's actually the malloc
calls to allocate all that memory that are taking forever. And then, at the end of this scope (assuming you're using CPython), it has to GC that whole gigantic list, which will also take forever.
You almost never want to use readlines
. Unless you're using a very old Python, just do this:
for line in f:
A file
is an iterable full of lines, just like the list
returned by readlines
—except that it's not actually a list
, it generates more lines on the fly by reading out of a buffer. So, at any given time, you'll only have one line and a couple of buffers on the order of 10MB each, instead of a 25GB list
. And the reading and decompressing will be spread out over the lifetime of the loop, instead of done all at once.
From a quick test, with a 3.5GB gzip file, gzip.open()
is effectively instant, for line in f: pass
takes a few seconds, gzip.close()
is effectively instant. But if I do for line in f.readlines(): pass
, it takes… well, I'm not sure how long, because after about a minute my system went into swap thrashing hell and I had to force-kill the interpreter to get it to respond to anything…
Since this has come up a dozen more times since this answer, I wrote this blog post which explains a bit more.
Have a look at pandas, in particular IO tools. They support gzip compression when reading files and you can read files in chunks. Besides, pandas is very fast and memory efficient.
As I never tried, I don't know how well the compression and reading in chunks live together, but it might be worth giving a try
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With