Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Always run a constant number of subprocesses in parallel

I want to use subprocesses to let 20 instances of a written script run parallel. Lets say i have a big list of urls with like 100.000 entries and my program should control that all the time 20 instances of my script are working on that list. I wanted to code it as follows:

urllist = [url1, url2, url3, .. , url100000]
i=0
while number_of_subproccesses < 20 and i<100000:
    subprocess.Popen(['python', 'script.py', urllist[i]]
    i = i+1

My script just writes something into a database or textfile. It doesnt output anything and dont need more input than the url.

My problem is i wasnt able to find something how to get the number of subprocesses that are active. Im a novice programmer so every hint and suggestion is welcome. I was also wondering how i can manage it once the 20 subprocesses are loaded that the while loop checks the conditions again? I thought of maybe putting another while loop over it, something like

while i<100000
   while number_of_subproccesses < 20:
       subprocess.Popen(['python', 'script.py', urllist[i]]
       i = i+1
       if number_of_subprocesses == 20:
           sleep() # wait to some time until check again

Or maybe theres a bette possibility that the while loop is always checking on the number of subprocesses?

I also considered using the module multiprocessing, but i found it really convenient to just call the script.py with subprocessing instead of a function with multiprocessing.

Maybe someone can help me and lead me into the right direction. Thanks Alot!

like image 260
zwieback86 Avatar asked Aug 08 '13 10:08

zwieback86


2 Answers

Just keep count as you start them and use a callback from each subprocess to start a new one if there are any url list entries to process.

e.g. Assuming that your sub-process calls the OnExit method passed to it as it ends:

NextURLNo = 0
MaxProcesses = 20
NoSubProcess = 0
MaxUrls = 100000

def StartNew():
   """ Start a new subprocess if there is work to do """
   global NextURLNo
   global NoSubProcess

   if NextURLNo < MaxUrls:
      subprocess.Popen(['python', 'script.py', urllist[NextURLNo], OnExit])
      print "Started to Process", urllist[NextURLNo]
      NextURLNo += 1
      NoSubProcess += 1

def OnExit():
   NoSubProcess -= 1

if __name__ == "__main__":
   for n in range(MaxProcesses):
      StartNew()
   while (NoSubProcess > 0):
      time.sleep(1)
      if (NextURLNo < MaxUrls):
         for n in range(NoSubProcess,MaxProcesses):
             StartNew()
like image 118
Steve Barnes Avatar answered Oct 23 '22 05:10

Steve Barnes


To keep constant number of concurrent requests, you could use a thread pool:

#!/usr/bin/env python
from multiprocessing.dummy import Pool

def process_url(url):
    # ... handle a single url

urllist = [url1, url2, url3, .. , url100000]
for _ in Pool(20).imap_unordered(process_url, urllist):
    pass

To run processes instead of threads, remove .dummy from the import.

like image 2
jfs Avatar answered Oct 23 '22 04:10

jfs