I have a handful of Python scripts each of which make heavy use of sorting, uniq-ing, counting, gzipping and gunzipping, and awking. As a first run through the code I've used subprocess.call
with (yes I know of the security risks that's why I said it is a first pass) shell=True
. I have a little helper function:
def do(command):
start = datetime.now()
return_code = call(command, shell=True)
print 'Completed in', str(datetime.now() - start), 'ms, return code =', return_code
if return_code != 0:
print 'Failure: aborting with return code %d' % return_code
sys.exit(return_code)
Scripts make use of this helper as in the following snippets:
do('gunzip -c %s | %s | sort -u | %s > %s' % (input, parse, flatten, output))
do("gunzip -c %s | grep 'en$' | cut -f1,2,4 -d\|| %s > %s" % (input, parse, output))
do('cat %s | %s | gzip -c > %s' % (input, dedupe, output))
do("awk -F ' ' '{print $%d,$%d}' %s | sort -u | %s | gzip -c > %s" % params)
do('gunzip -c %s | %s | gzip -c > %s' % (input, parse, output))
do('gunzip -c %s | %s > %s' % (input, parse, collection))
do('%s < %s >> %s' % (parse, supplement, collection))
do('cat %s %s | sort -k 2 | %s | gzip -c > %s' % (source,other_source,match,output)
And there are many more like these, some with even longer pipelines.
One issue I notice is that when a command early in a pipeline fails, the whole command will still succeed with exit status 0. In bash I fix this with
set -o pipefail
but I do not see how this can be done in Python. I suppose I could put in an explicit call to bash but that seems wrong. Is it?
In lieu of an answer to that specific question, I'd love to hear alternatives to implementing this kind of code in pure Python without resorting to shell=True
. But when I attempt to use Popen
and stdout=PIPE
the code size blows up. There is something nice about writing pipelines on one line as a string, but if anyone knows an elegant multiline "proper and secure" way to do this in Python I would love to hear it!
An aside: none of these scripts ever take user input; they run batch jobs on a machine with a known shell which is why I actually ventured into the evil shell=True
just to see how things would look. And they do look pretty easy to read and the code seems so concise! How does one remove the shell=True
and run these long pipelines in raw Python while still getting the advantages of aborting the process if an early component fails?
You can set the pipefail
in the calls to system:
def do(command):
start = datetime.now()
return_code = call([ '/bin/bash', '-c', 'set -o pipefail; ' + command ])
...
Or, as @RayToal pointed out in a comment, use the -o
option of the shell to set this flag: call([ '/bin/bash', '-o', 'pipefail', '-c', command ])
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With