Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python vs Perl: performance reading a gzipped file

I have a gzipped data file containing a million lines:

$ zcat million_lines.txt.gz | head
1
2
3
4
5
6
7
8
9
10
...

My Perl script which processes this file is as follows:

# read_million.pl
use strict; 

my $file = "million_lines.txt.gz" ;

open MILLION, "gzip -cdfq $file |";

while ( <MILLION> ) {
    chomp $_; 
    if ($_ eq "1000000" ) {
        print "This is the millionth line: Perl\n"; 
        last; 
    }
}

In Python:

# read_million.py
import gzip

filename = 'million_lines.txt.gz'

fh = gzip.open(filename)

for line in fh:
    line = line.strip()
    if line == '1000000':
        print "This is the millionth line: Python"
        break

For whatever reason, the Python script takes almost ~8x longer:

$ time perl read_million.pl ; time python read_million.py
This is the millionth line: Perl

real    0m0.329s
user    0m0.165s
sys     0m0.019s
This is the millionth line: Python

real    0m2.663s
user    0m2.154s
sys     0m0.074s

I tried profiling both scripts, but there really isn't much code to profile. The Python script spends most of its time on for line in fh; the Perl script spends most of its time in if($_ eq "1000000").

Now, I know that Perl and Python have some expected differences. For instance, in Perl, I open up the filehandle using a subproc to UNIX gzip command; in Python, I use the gzip library.

What can I do to speed up the Python implementation of this script (even if I never reach the Perl performance)? Perhaps the gzip module in Python is slow (or perhaps I'm using it in a bad way); is there a better solution?

EDIT #1

Here's what the read_million.py line-by-line profiling looks like.

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     2                                           @profile
     3                                           def main():
     4
     5         1            1      1.0      0.0         filename = 'million_lines.txt.gz'
     6         1          472    472.0      0.0         fh = gzip.open(filename)
     7   1000000      5507042      5.5     84.3         for line in fh:
     8   1000000       582653      0.6      8.9                 line = line.strip()
     9   1000000       443565      0.4      6.8                 if line == '1000000':
    10         1           25     25.0      0.0                         print "This is the millionth line: Python"
    11         1            0      0.0      0.0                         break

EDIT #2:

I have now also tried subprocess python module as per @Kirk Strauser, and others. It is faster:

Python "subproc" solution:

# read_million_subproc.py 
import subprocess

filename = 'million_lines.txt.gz'
gzip = subprocess.Popen(['gzip', '-cdfq', filename], stdout=subprocess.PIPE)
for line in gzip.stdout: 
    line = line.strip()
    if line == '1000000':
        print "This is the millionth line: Python"
        break
gzip.wait()

Here is a comparative table of all the things I've tried so far:

method                    average_running_time (s)
--------------------------------------------------
read_million.py           2.708
read_million_subproc.py   0.850
read_million.pl           0.393
like image 912
asf107 Avatar asked Apr 11 '16 21:04

asf107


1 Answers

Having tested a number of possibilities, it looks like the big culprits here are:

  1. Comparing apples to oranges: In your original test case, Perl wasn't doing the file I/O or decompression work, the gzip program was doing so (and it's written in C, so it runs pretty fast); in that version of the code, you're comparing parallel computation to serial computation.
  2. Interpreter startup time; on the vast majority of systems, Python takes substantially longer to begin running (I believe because more files are loaded at startup). Interpreter startup time on my machine is about half the total wall clock time, 30% of the user time, and most of the system time. The actual work done in Python is swamped by start up time, so your benchmark is as much about comparing startup time as comparing time required to do the work. Later addition: You can reduce the overhead from Python startup a bit further by invoking python with the -E switch (to disable checking of PYTHON* environment variables at startup) and the -S switch (to disable automatic import site, which avoids a lot of dynamic sys.path setup/manipulation involving disk I/O at the expense of cutting off access to any non-builtin libraries).
  3. Python's subprocess module is a bit higher level than Perl's open call, and is implemented in Python (on top of lower level primitives). The generalized subprocess code takes longer to load (exacerbating startup time issues) and adds overhead to the process launch itself.
  4. Python 2's subprocess defaults to unbuffered I/O, so you're performing more system calls unless you pass an explicit bufsize argument (4096 to 8192 seems to work fine)
  5. The line.strip() call involves more overhead than you might think; function & method calls are more expensive in Python than they really should be, and line.strip() does not mutate the str in place the way Perl's chomp does (because Python's str is immutable, while Perl strings are mutable)

A couple versions of the code that will bypass most of these problems. First, optimized subprocess:

#!/usr/bin/env python

import subprocess

# Launch with subprocess in list mode (no shell involved) and
# use a meaningful buffer size to minimize system calls
proc = subprocess.Popen(['gzip', '-cdfq', 'million_lines.txt.gz'], stdout=subprocess.PIPE, bufsize=4096)
# Iterate stdout directly
for line in proc.stdout:
    if line == '1000000\n':  # Avoid stripping
        print("This is the millionth line: Python")
        break
# Prevent deadlocks by terminating, not waiting, child process
proc.terminate()

Second, pure Python, mostly built-in (C level) API based code (which eliminates most extraneous startup overhead, and shows that Python's gzip module is not meaningfully distinct from the gzip program), ridiculously microoptimized at the expense of readability/maintainability/brevity/portability:

#!/usr/bin/env python

import os

rpipe, wpipe = os.pipe()

def reader():
    import gzip
    FILE = "million_lines.txt.gz"
    os.close(rpipe)
    with gzip.open(FILE) as inf, os.fdopen(wpipe, 'wb') as outf:
        buf = bytearray(16384)  # Reusable buffer to minimize allocator overhead
        while 1:
            cnt = inf.readinto(buf)
            if not cnt: break
            outf.write(buf[:cnt] if cnt != 16384 else buf)

pid = os.fork()
if not pid:
    try:
        reader()
    finally:
        os._exit()

try:
    os.close(wpipe)
    with os.fdopen(rpipe, 'rb') as f:
        for line in f:
            if line == b'1000000\n':
                print("This is the millionth line: Python")
                break
finally:
    os.kill(pid, 9)

On my local system, on the best of half a dozen runs, the subprocess code takes:

0.173s/0.157s/0.031s wall/user/sys time.

The primitives based Python code with no external utility programs gets that down to a best time of:

0.147s/0.103s/0.013s

(though that was an outlier; a good wall clock time was usually more like 0.165). Adding -E -S to the invocation shaves another 0.01-0.015s wall clock and user time by removing the overhead of setting up the import machinery to handle non-builtins; in other comments, you mention that your Python takes nearly 0.6 seconds to launch doing absolutely nothing (but otherwise seems to perform similarly to mine), which may indicate you've got quite a bit more in the way of non-default packages or environment customization going on, and -E -S may save you more.

The Perl code, unmodified from what you gave me (aside from using 3+ arg open to remove string parsing and storing the pid returned from open to explicitly kill it before exiting) had a best time of:

0.183s/0.216s/0.005s

Regardless, we're talking about trivial differences (the timing jitter from run to run was around 0.025s for wall clock and user time, so Python's wins on wall clock time were mostly insignificant, though it did save on user time meaningfully). Python can win, as can Perl, but non-language related concerns are more important.

like image 157
ShadowRanger Avatar answered Nov 15 '22 19:11

ShadowRanger