Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to join the stdout of two subprocesses and pipe to stdin of new subprocess in python

Lets say I had the following command running from the shell

{ 
samtools view -HS header.sam;           # command1
samtools view input.bam 1:1-50000000;   # command2
} | samtools view -bS - > output.bam    # command3

For those of you who aren't familiar with samtools view (Since this is stackoverflow). What this is essentially doing is creating a new bam file that has a new header. bam files are typically large compressed files, so even passing through the file in some cases can be time consuming. One alternative approach would be to undergo command2, and then use samtools reheader to switch the header. This passes through the large file twice. The above command passes through the bam a single time which is good for larger bam files (They get to be larger then 20GB even when compressed - WGS).

My question is how to implement commands of this type in python using subprocess.

I have the following:

fh_bam = open('output.bam', 'w')
params_0 = [ "samtools", "view", "-HS", "header.sam" ]
params_1 = [ "samtools", "view", "input.bam", "1:1-50000000"]
params_2 = [ "samtools", "view", "-bS", "-" ]
sub_0 = subprocess.Popen(params_0, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
sub_1 = subprocess.Popen(params_1, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
### SOMEHOW APPEND sub_1.stdout to sub_0.stdout
sub_2 = subprocess.Popen(params_2, stdin=appended.stdout, stdout=fh_bam)

Any help is greatly appreciated. Thanks.

like image 541
Marco Albuquerque Avatar asked Feb 27 '15 21:02

Marco Albuquerque


2 Answers

If you already have the shell command in the string then you could just run it as is:

#!/usr/bin/env python
from subprocess import check_call

check_call(r"""
{ 
samtools view -HS header.sam;           # command1
samtools view input.bam 1:1-50000000;   # command2
} | samtools view -bS - > output.bam    # command3
""", shell=True)

To emulate the pipeline in Python:

#!/usr/bin/env python
from subprocess import Popen, PIPE

# start command3 to get stdin pipe, redirect output to the file
with open('output.bam', 'wb', 0) as output_file:
    command3 = Popen("samtools view -bS -".split(), 
                     stdin=PIPE, stdout=output_file)
# start command1 with its stdout redirected to command3 stdin
command1 = Popen('samtools view -HS header.sam'.split(),
                 stdout=command3.stdin)
rc_command1 = command1.wait() #NOTE: command3.stdin is not closed, no SIGPIPE or a write error if command3 dies
# start command2 after command1 finishes
command2 = Popen('samtools view input.bam 1:1-50000000'.split(),
                 stdout=command3.stdin)
command3.stdin.close() # inform command2 if command3 dies (SIGPIPE or a write error)
rc_command2 = command2.wait()
rc_command3 = command3.wait()
like image 165
jfs Avatar answered Oct 28 '22 17:10

jfs


(I can't comment sadly, but this 'answer' is a comment to cmidi's answer, if somebody can move it it would be appreciated! -- PS: That answer was now deleted...)

Marco explicitly said that the commands produce a lot of output, around 20GB. If you use communicate() it will wait for the process to terminate, which means that the 'fd' descriptor will need to hold that big amount of data. In practice the OS will flush the data to disk in the meantime, unless your computer has more than 20GB free RAM. So you end up writing the intermediate data to disk, which the original author wanted to avoid. +1 for sirlark's answer!

like image 23
Ariel Avatar answered Oct 28 '22 16:10

Ariel