I'm having a problem with processing a largeish file in Python. All I'm doing is
f = gzip.open(pathToLog, 'r')
for line in f:
counter = counter + 1
if (counter % 1000000 == 0):
print counter
f.close
This takes around 10m25s just to open the file, read the lines and increment this counter.
In perl, dealing with the same file and doing quite a bit more (some regular expression stuff), the whole process takes around 1m17s.
Perl Code:
open(LOG, "/bin/zcat $logfile |") or die "Cannot read $logfile: $!\n";
while (<LOG>) {
if (m/.*\[svc-\w+\].*login result: Successful\.$/) {
$_ =~ s/some regex here/$1,$2,$3,$4/;
push @an_array, $_
}
}
close LOG;
Can anyone advise what I can do to make the Python solution run at a similar speed to the Perl solution?
EDIT I've tried just uncompressing the file and dealing with it using open instead of gzip.open, but that only changes the total time to around 4m14.972s, which is still too slow.
I also removed the modulo and print statements and replaced them with pass, so all that is being done now is moving from file to file.
In Python (at least <= 2.6.x), gzip format parsing is implemented in Python (over zlib). More, it appears to be doing some strange things, namely, decompress to the end of file to memory and then discard everything beyond the requested read size (then do it again for next read). DISCLAIMER: I've just looked at gzip.read()
for 3 minutes, so I can be wrong here. Regardless of whether my understanding of gzip.read() is correct or not, gzip module appears to be not optimized for large data volumes. Try doing the same thing as in Perl, i.e. launching an external process (e.g. see module subprocess
).
EDIT Actually, I missed the OP's remark about plain file I/O being just as slow as compressed (thanks to ire_and_curses for pointing it out). This striken me as unlikely, so I did some measurements...
from timeit import Timer
def w(n):
L = "*"*80+"\n"
with open("ttt", "w") as f:
for i in xrange(n) :
f.write(L)
def r():
with open("ttt", "r") as f:
for n,line in enumerate(f) :
if n % 1000000 == 0 :
print n
def g():
f = gzip.open("ttt.gz", "r")
for n,line in enumerate(f) :
if n % 1000000 == 0 :
print n
Now, running it...
>>> Timer("w(10000000)", "from __main__ import w").timeit(1)
14.153118133544922
>>> Timer("r()", "from __main__ import r").timeit(1)
1.6482770442962646
# here i switched to a terminal and made ttt.gz from ttt
>>> Timer("g()", "from __main__ import g").timeit(1)
...and after having a tea break and discovering that it's still running, I've killed it, sorry. Then I tried 100'000 lines instead of 10'000'000:
>>> Timer("w(100000)", "from __main__ import w").timeit(1)
0.05810999870300293
>>> Timer("r()", "from __main__ import r").timeit(1)
0.09662318229675293
# here i switched to a terminal and made ttt.gz from ttt
>>> Timer("g()", "from __main__ import g").timeit(1)
11.939290046691895
Module gzip's time is O(file_size**2), so with number of lines on the order of millions, gzip read time just cannot be the same as plain read time (as we see confirmed by an experiment). Anonymouslemming, please check again.
If you Google "why is python gzip slow" you'll find plenty of discussion of this, including patches for improvements in Python 2.7 and 3.2. In the meantime, use zcat as you did in Perl which is wicked fast. Your (first) function takes me about 4.19s with a 5MB compressed file, and the second function takes 0.78s. However, I don't know what's going on with your uncompressed files. If I uncompress the log files (apache logs) and run the two function on them with a simple Python open(file), and Popen('cat'), Python is faster (0.17s) than cat (0.48s).
#!/usr/bin/python import gzip from subprocess import PIPE, Popen import sys import timeit #pathToLog = 'big.log.gz' # 50M compressed (*10 uncompressed) pathToLog = 'small.log.gz' # 5M "" def test_ori(): counter = 0 f = gzip.open(pathToLog, 'r') for line in f: counter = counter + 1 if (counter % 100000 == 0): # 1000000 print counter, line f.close def test_new(): counter = 0 content = Popen(["zcat", pathToLog], stdout=PIPE).communicate()[0].split('\n') for line in content: counter = counter + 1 if (counter % 100000 == 0): # 1000000 print counter, line if '__main__' == __name__: to = timeit.Timer('test_ori()', 'from __main__ import test_ori') print "Original function time", to.timeit(1) tn = timeit.Timer('test_new()', 'from __main__ import test_new') print "New function time", tn.timeit(1)
I spent a while on this. Hopefully this code will do the trick. It uses zlib and no external calls.
The gunzipchunks method reads the compressed gzip file in chunks which can be iterated over (generator).
The gunziplines method reads these uncompressed chunks and provides you with one line at a time which can also be iterated over (another generator).
Finally, the gunziplinescounter method gives you what you're looking for.
Cheers!
import zlib
file_name = 'big.txt.gz'
#file_name = 'mini.txt.gz'
#for i in gunzipchunks(file_name): print i
def gunzipchunks(file_name,chunk_size=4096):
inflator = zlib.decompressobj(16+zlib.MAX_WBITS)
f = open(file_name,'rb')
while True:
packet = f.read(chunk_size)
if not packet: break
to_do = inflator.unconsumed_tail + packet
while to_do:
decompressed = inflator.decompress(to_do, chunk_size)
if not decompressed:
to_do = None
break
yield decompressed
to_do = inflator.unconsumed_tail
leftovers = inflator.flush()
if leftovers: yield leftovers
f.close()
#for i in gunziplines(file_name): print i
def gunziplines(file_name,leftovers="",line_ending='\n'):
for chunk in gunzipchunks(file_name):
chunk = "".join([leftovers,chunk])
while line_ending in chunk:
line, leftovers = chunk.split(line_ending,1)
yield line
chunk = leftovers
if leftovers: yield leftovers
def gunziplinescounter(file_name):
for counter,line in enumerate(gunziplines(file_name)):
if (counter % 1000000 != 0): continue
print "%12s: %10d" % ("checkpoint", counter)
print "%12s: %10d" % ("final result", counter)
print "DEBUG: last line: [%s]" % (line)
gunziplinescounter(file_name)
This should run a whole lot faster than using the builtin gzip module on extremely large files.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With