I have a gzipped data file containing a million lines:
$ zcat million_lines.txt.gz | head
1
2
3
4
5
6
7
8
9
10
...
My Perl script which processes this file is as follows:
# read_million.pl
use strict;
my $file = "million_lines.txt.gz" ;
open MILLION, "gzip -cdfq $file |";
while ( <MILLION> ) {
chomp $_;
if ($_ eq "1000000" ) {
print "This is the millionth line: Perl\n";
last;
}
}
In Python:
# read_million.py
import gzip
filename = 'million_lines.txt.gz'
fh = gzip.open(filename)
for line in fh:
line = line.strip()
if line == '1000000':
print "This is the millionth line: Python"
break
For whatever reason, the Python script takes almost ~8x longer:
$ time perl read_million.pl ; time python read_million.py
This is the millionth line: Perl
real 0m0.329s
user 0m0.165s
sys 0m0.019s
This is the millionth line: Python
real 0m2.663s
user 0m2.154s
sys 0m0.074s
I tried profiling both scripts, but there really isn't much code to profile. The Python script spends most of its time on for line in fh
; the Perl script spends most of its time in if($_ eq "1000000")
.
Now, I know that Perl and Python have some expected differences. For instance, in Perl, I open up the filehandle using a subproc to UNIX gzip
command; in Python, I use the gzip
library.
What can I do to speed up the Python implementation of this script (even if I never reach the Perl performance)? Perhaps the gzip
module in Python is slow (or perhaps I'm using it in a bad way); is there a better solution?
Here's what the read_million.py
line-by-line profiling looks like.
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2 @profile
3 def main():
4
5 1 1 1.0 0.0 filename = 'million_lines.txt.gz'
6 1 472 472.0 0.0 fh = gzip.open(filename)
7 1000000 5507042 5.5 84.3 for line in fh:
8 1000000 582653 0.6 8.9 line = line.strip()
9 1000000 443565 0.4 6.8 if line == '1000000':
10 1 25 25.0 0.0 print "This is the millionth line: Python"
11 1 0 0.0 0.0 break
EDIT #2:
I have now also tried subprocess
python module as per @Kirk Strauser, and others. It is faster:
Python "subproc" solution:
# read_million_subproc.py
import subprocess
filename = 'million_lines.txt.gz'
gzip = subprocess.Popen(['gzip', '-cdfq', filename], stdout=subprocess.PIPE)
for line in gzip.stdout:
line = line.strip()
if line == '1000000':
print "This is the millionth line: Python"
break
gzip.wait()
Here is a comparative table of all the things I've tried so far:
method average_running_time (s)
--------------------------------------------------
read_million.py 2.708
read_million_subproc.py 0.850
read_million.pl 0.393
Having tested a number of possibilities, it looks like the big culprits here are:
gzip
program was doing so (and it's written in C, so it runs pretty fast); in that version of the code, you're comparing parallel computation to serial computation.python
with the -E
switch (to disable checking of PYTHON*
environment variables at startup) and the -S
switch (to disable automatic import site
, which avoids a lot of dynamic sys.path
setup/manipulation involving disk I/O at the expense of cutting off access to any non-builtin libraries).subprocess
module is a bit higher level than Perl's open
call, and is implemented in Python (on top of lower level primitives). The generalized subprocess
code takes longer to load (exacerbating startup time issues) and adds overhead to the process launch itself.subprocess
defaults to unbuffered I/O, so you're performing more system calls unless you pass an explicit bufsize
argument (4096 to 8192 seems to work fine)line.strip()
call involves more overhead than you might think; function & method calls are more expensive in Python than they really should be, and line.strip()
does not mutate the str
in place the way Perl's chomp
does (because Python's str
is immutable, while Perl strings are mutable)A couple versions of the code that will bypass most of these problems. First, optimized subprocess
:
#!/usr/bin/env python
import subprocess
# Launch with subprocess in list mode (no shell involved) and
# use a meaningful buffer size to minimize system calls
proc = subprocess.Popen(['gzip', '-cdfq', 'million_lines.txt.gz'], stdout=subprocess.PIPE, bufsize=4096)
# Iterate stdout directly
for line in proc.stdout:
if line == '1000000\n': # Avoid stripping
print("This is the millionth line: Python")
break
# Prevent deadlocks by terminating, not waiting, child process
proc.terminate()
Second, pure Python, mostly built-in (C level) API based code (which eliminates most extraneous startup overhead, and shows that Python's gzip
module is not meaningfully distinct from the gzip
program), ridiculously microoptimized at the expense of readability/maintainability/brevity/portability:
#!/usr/bin/env python
import os
rpipe, wpipe = os.pipe()
def reader():
import gzip
FILE = "million_lines.txt.gz"
os.close(rpipe)
with gzip.open(FILE) as inf, os.fdopen(wpipe, 'wb') as outf:
buf = bytearray(16384) # Reusable buffer to minimize allocator overhead
while 1:
cnt = inf.readinto(buf)
if not cnt: break
outf.write(buf[:cnt] if cnt != 16384 else buf)
pid = os.fork()
if not pid:
try:
reader()
finally:
os._exit()
try:
os.close(wpipe)
with os.fdopen(rpipe, 'rb') as f:
for line in f:
if line == b'1000000\n':
print("This is the millionth line: Python")
break
finally:
os.kill(pid, 9)
On my local system, on the best of half a dozen runs, the subprocess
code takes:
0.173s/0.157s/0.031s wall/user/sys time.
The primitives based Python code with no external utility programs gets that down to a best time of:
0.147s/0.103s/0.013s
(though that was an outlier; a good wall clock time was usually more like 0.165). Adding -E -S
to the invocation shaves another 0.01-0.015s wall clock and user time by removing the overhead of setting up the import machinery to handle non-builtins; in other comments, you mention that your Python takes nearly 0.6 seconds to launch doing absolutely nothing (but otherwise seems to perform similarly to mine), which may indicate you've got quite a bit more in the way of non-default packages or environment customization going on, and -E -S
may save you more.
The Perl code, unmodified from what you gave me (aside from using 3+ arg open
to remove string parsing and storing the pid
returned from open
to explicitly kill
it before exiting) had a best time of:
0.183s/0.216s/0.005s
Regardless, we're talking about trivial differences (the timing jitter from run to run was around 0.025s for wall clock and user time, so Python's wins on wall clock time were mostly insignificant, though it did save on user time meaningfully). Python can win, as can Perl, but non-language related concerns are more important.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With