[Edit: Read accepted answer first. The long investigation below stems from a subtle blunder in the timing measurement.]
I often need to process extremely large (100GB+) text/CSV-like files containing highly redundant data that cannot practically be stored on disk uncompressed. I rely heavily on external compressors like lz4 and zstd, which produce stdout streams approaching 1GB/s.
As such, I care a lot about the performance of Unix shell pipelines. But large shell scripts are difficult to maintain, so I tend to construct pipelines in Python, stitching commands together with careful use of shlex.quote()
.
This process is tedious and error-prone, so I'd like a "Pythonic" way to achieve the same end, managing the stdin/stdout file descriptors in Python without offloading to /bin/sh
. However, I've never found a method of doing this without greatly sacrificing performance.
Python 3's documentation recommends replacing shell pipelines with the communicate()
method on subprocess.Popen
. I've adapted this example to create the following test script, which pipes 3GB of /dev/zero
into a useless grep
, which outputs nothing:
#!/usr/bin/env python3
from shlex import quote
from subprocess import Popen, PIPE
from time import perf_counter
BYTE_COUNT = 3_000_000_000
UNQUOTED_HEAD_CMD = ["head", "-c", str(BYTE_COUNT), "/dev/zero"]
UNQUOTED_GREP_CMD = ["grep", "Arbitrary string which will not be found."]
QUOTED_SHELL_PIPELINE = " | ".join(
" ".join(quote(s) for s in cmd)
for cmd in [UNQUOTED_HEAD_CMD, UNQUOTED_GREP_CMD]
)
perf_counter()
proc = Popen(QUOTED_SHELL_PIPELINE, shell=True)
proc.wait()
print(f"Time to run using shell pipeline: {perf_counter()} seconds")
perf_counter()
p1 = Popen(UNQUOTED_HEAD_CMD, stdout=PIPE)
p2 = Popen(UNQUOTED_GREP_CMD, stdin=p1.stdout, stdout=PIPE)
p1.stdout.close()
p2.communicate()
print(f"Time to run using subprocess.PIPE: {perf_counter()} seconds")
Output:
Time to run using shell pipeline: 2.412427189 seconds
Time to run using subprocess.PIPE: 4.862174164 seconds
The subprocess.PIPE
approach is more than twice as slow as /bin/sh
. If we raise the input size to 90GB (BYTE_COUNT = 90_000_000_000
), we confirm this is not a constant-time overhead:
Time to run using shell pipeline: 88.796322932 seconds
Time to run using subprocess.PIPE: 183.734968687 seconds
My assumption up to now was that subprocess.PIPE
is simply a high-level abstraction for connecting file descriptors, and that data is never copied into the Python process itself. As expected, when running the above test head
uses 100% CPU but subproc_test.py
uses near-zero CPU and RAM.
Given that, why is my pipeline so slow? Is this an intrinsic limitation of Python's subprocess
? If so, what does /bin/sh
do differently under the hood that makes it twice as fast?
More generally, are there better methods for building large, high-performance subprocess pipelines in Python?
You're timing it wrong. Your perf_counter()
calls don't start and stop a timer; they just return a number of seconds since some arbitrary starting point. That starting point probably happens to be the first perf_counter()
call here, but it could be any point, even one in the future.
The actual time taken by the subprocess.PIPE
method is 4.862174164 - 2.412427189 = 2.449746975 seconds, not 4.862174164 seconds. This timing does not show a measurable performance penalty from subprocess.PIPE
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With