Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel running of several jobs in a python script

I am not a programmer and hence simple answers will be appreciated. I am a MD and am involved in a bioinformatics project.

Let's say I have a Python script, abc.py and I have a text file, commandline.txt with 113 command lines, 1 in each line, for this script to be run in parallel. I want each of these jobs to be run in its own directory called scatter.001, scatter.002, ... , scatter.113, (just a unique number for each), to be created in the directory where I am executing the script from.

I am running, Windows 7 with Python 2.7.

What is the command line for doing this? (python xyz\abc.py ....... )

PS:

-p 100 -m 10000000 -e 10 -k I:\Exome\Invex\analyses\PatientSet.load_maf.pkl ,UBE2Q1,RNF17,RNF10,REM1,PMM2,ZNF709,ZNF708,ZNF879,DISC1,RPL37,ZNF700,ZNF707,CAMK4,ZC3H10,ZC3H13,RNF115,ZC3H14,SPN,HMGCLL1,CEACAM5,GRIN1,DHX8,NUP98,XPC,SP4,SP5,CAMKV,SPPL3,RAB40C,RAB40A,COL7A1,GTSE1,OVCH1,FAM183B,KIAA0831,SPPL2B,ITGA8,ITGA9,MYO3B,ATP2A2,ITGA1,ITGA2,ITGA3,ITGA5,RIT1,ITGA7,TRHR,LOC100132288,DENND4A,DENND4B,TAP2,GAP43,PAMR1,HRH2,HRH3,HRH1,FBXL18,FAM169B,GHDC,SDK1,SDK2,THSD4,THSD1,ZFP161,CHST8,COL4A5,COL4A4,COL4A3,COL4A2,COL4A1,CHST1,CHST5,CHST4,ITGAX I:\Exome\Invex\analyses\First7.final_analysis_set.maf I:\Exome\Invex\temp\unzipped_power_files First7 I:\Exome\Invex\analyses\First7.individual_set.txt I:\Exome\Invex\hg19.fasta I:\Exome\Invex\hg19_encoded_by_trinucleotide.fasta I:\Exome\Invex\TCGA.hg19.June2011.gaf I:\Exome\Invex\hg19 I:\Exome\Invex\pph2_whpss_reduced I:\Exome\Invex\cosmic_num_times_each_chr_pos_mutated.tab

That is an example of one line in commandline.txt. I have 113 such lines, in the file..

like image 718
Shyam_LA Avatar asked Dec 04 '22 15:12

Shyam_LA


2 Answers

If you go this way, you're getting into windows shell programming, which nobody does. (I mean somebody does it, but they're an extremely small group.)

It would be simplest if you wrote a second python script that loops through the arguments that you want to pass to the second script, and calls a functoin with those arguments.

from subprocess import Popen
from os import mkdir

argfile = open('commandline.txt')
for number, line in enumerate(argfile):    
    newpath = 'scatter.%03i' % number 
    mkdir(newpath)
    cmd = '../abc.py ' + line.strip()
    print 'Running %r in %r' % (cmd, newpath)
    Popen(cmd, shell=True, cwd=newpath)

This creates a directory, and runs your command as a separate process in that directory. Since it doesn't wait for the subprocess to finish before starting another, this gives the paralellism you want.


The in-series version just waits before it starts another subprocess. Add one line at the end of the loop:

    p = Popen(cmd, shell=True, cwd=newpath)
    p.wait()
like image 184
bukzor Avatar answered Dec 07 '22 03:12

bukzor


This python script should do it in parallel:

import os, subprocess
n = 0
for cmd in open('commandline.txt'):
    newpath = 'scatter.%03d' % n 
    os.mkdir(newpath)
    subprocess.Popen("..\\abc.py " + cmd, shell=True, cwd=newpath)
    n += 1

Note that this assumes abc.py and commandline.txt are in the same directory. If this was not the case, you would have to update the string to something like "C:\\path\\to\\abc.py"

like image 31
Luke Avatar answered Dec 07 '22 05:12

Luke