Given a large list (1,000+) of completely independent objects that each need to be manipulated through some expensive function (~5 minutes each), what is the best way to distribute the work over other cores? Theoretically, I could just cut up the list into equal parts and serialize the data with cPickle (takes a few seconds) and launch a new python processes for each chunk--and it may just come to that if I intend to use multiple computers--but this feels like more of a hack than anything. Surely there is a more integrated way to do this using a multiprocessing library? Am I over-thinking this?
Thanks.
This sounds like a good use case for a multiprocessing.Pool; depending on exactly what you're doing, it could be as simple as
pool = multiprocessing.Pool(num_procs)
results = pool.map(the_function, list_of_objects)
pool.close()
This will pickle each object in the list independently. If that's a problem, there are various ways to get around that (though all with their own problems and I don't know if any of them work on Windows). Since your computation times are fairly long that's probably irrelevant.
Since you're running this for 5 minutes x 1000 items = several days / number of cores, you probably want to do some saving of partial results along the way and print out some progress information. The easiest thing to do is probably to have your function you call save its results to a file or database or whatever; if that's not practical, you could also use apply_async in a loop and handle the results as they come in.
You could also look into something like joblib to handle this for you; I'm not very familiar with it but it seems like it's approaching the same problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With