Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: running subprocess in parallel [duplicate]

I have the following code that writes the md5sums to a logfile

for file in files_output:     p=subprocess.Popen(['md5sum',file],stdout=logfile) p.wait() 
  1. Will these be written in parallel? i.e. if md5sum takes a long time for one of the files, will another one be started before waiting for a previous one to complete?

  2. If the answer to the above is yes, can I assume the order of the md5sums written to logfile may differ based upon how long md5sum takes for each file? (some files can be huge, some small)

like image 650
imagineerThat Avatar asked May 08 '13 21:05

imagineerThat


People also ask

Does Python subprocess run parallel?

We can use subprocess module to create multiple child processes and they are run in parallel. First, we search the current directory and obtain a list of all the compressed files. Next, we create a list of the sequence of program arguments, each list element corresponding to each file.

What is subprocess Popen in Python?

The subprocess module defines one class, Popen and a few wrapper functions that use that class. The constructor for Popen takes arguments to set up the new process so the parent can communicate with it via pipes. It provides all of the functionality of the other modules and functions it replaces, and more.

Does subprocess Popen block?

Popen is nonblocking. call and check_call are blocking. You can make the Popen instance block by calling its wait or communicate method.


1 Answers

  1. Yes, these md5sum processes will be started in parallel.
  2. Yes, the order of md5sums writes will be unpredictable. And generally it is considered a bad practice to share a single resource like file from many processes this way.

Also your way of making p.wait() after the for loop will wait just for the last of md5sum processes to finish and the rest of them might still be running.

But you can modify this code slightly to still have benefits of parallel processing and predictability of synchronized output if you collect the md5sum output into temporary files and collect it back into one file once all processes are done.

import subprocess import os  processes = [] for file in files_output:     f = os.tmpfile()     p = subprocess.Popen(['md5sum',file],stdout=f)     processes.append((p, f))  for p, f in processes:     p.wait()     f.seek(0)     logfile.write(f.read())     f.close() 
like image 61
dkz Avatar answered Sep 23 '22 15:09

dkz