Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Achieving shell-like pipeline performance in Python

[Edit: Read accepted answer first. The long investigation below stems from a subtle blunder in the timing measurement.]

I often need to process extremely large (100GB+) text/CSV-like files containing highly redundant data that cannot practically be stored on disk uncompressed. I rely heavily on external compressors like lz4 and zstd, which produce stdout streams approaching 1GB/s.

As such, I care a lot about the performance of Unix shell pipelines. But large shell scripts are difficult to maintain, so I tend to construct pipelines in Python, stitching commands together with careful use of shlex.quote().

This process is tedious and error-prone, so I'd like a "Pythonic" way to achieve the same end, managing the stdin/stdout file descriptors in Python without offloading to /bin/sh. However, I've never found a method of doing this without greatly sacrificing performance.

Python 3's documentation recommends replacing shell pipelines with the communicate() method on subprocess.Popen. I've adapted this example to create the following test script, which pipes 3GB of /dev/zero into a useless grep, which outputs nothing:

#!/usr/bin/env python3
from shlex import quote
from subprocess import Popen, PIPE
from time import perf_counter

BYTE_COUNT = 3_000_000_000
UNQUOTED_HEAD_CMD = ["head", "-c", str(BYTE_COUNT), "/dev/zero"]
UNQUOTED_GREP_CMD = ["grep", "Arbitrary string which will not be found."]

QUOTED_SHELL_PIPELINE = " | ".join(
    " ".join(quote(s) for s in cmd)
    for cmd in [UNQUOTED_HEAD_CMD, UNQUOTED_GREP_CMD]
)

perf_counter()
proc = Popen(QUOTED_SHELL_PIPELINE, shell=True)
proc.wait()
print(f"Time to run using shell pipeline: {perf_counter()} seconds")

perf_counter()
p1 = Popen(UNQUOTED_HEAD_CMD, stdout=PIPE)
p2 = Popen(UNQUOTED_GREP_CMD, stdin=p1.stdout, stdout=PIPE)
p1.stdout.close()
p2.communicate()
print(f"Time to run using subprocess.PIPE: {perf_counter()} seconds")

Output:

Time to run using shell pipeline: 2.412427189 seconds
Time to run using subprocess.PIPE: 4.862174164 seconds

The subprocess.PIPE approach is more than twice as slow as /bin/sh. If we raise the input size to 90GB (BYTE_COUNT = 90_000_000_000), we confirm this is not a constant-time overhead:

Time to run using shell pipeline: 88.796322932 seconds
Time to run using subprocess.PIPE: 183.734968687 seconds

My assumption up to now was that subprocess.PIPE is simply a high-level abstraction for connecting file descriptors, and that data is never copied into the Python process itself. As expected, when running the above test head uses 100% CPU but subproc_test.py uses near-zero CPU and RAM.

Given that, why is my pipeline so slow? Is this an intrinsic limitation of Python's subprocess? If so, what does /bin/sh do differently under the hood that makes it twice as fast?

More generally, are there better methods for building large, high-performance subprocess pipelines in Python?

like image 399
goodside Avatar asked Oct 18 '18 18:10

goodside


1 Answers

You're timing it wrong. Your perf_counter() calls don't start and stop a timer; they just return a number of seconds since some arbitrary starting point. That starting point probably happens to be the first perf_counter() call here, but it could be any point, even one in the future.

The actual time taken by the subprocess.PIPE method is 4.862174164 - 2.412427189 = 2.449746975 seconds, not 4.862174164 seconds. This timing does not show a measurable performance penalty from subprocess.PIPE.

like image 64
user2357112 supports Monica Avatar answered Nov 16 '22 23:11

user2357112 supports Monica