I'm new to parrallel programming. My task is to analyze hundreds of data files. Each of those data is nearly 300MB, and could be sliced into numerous slices. My computer is a 4-core pc. And I want to get the result of each data as soon as possible.
The analysis of each data file consists of 2 procedures. First, read data into memory, and then slice it into slices, which is io intensive work. Then, do lots of computation for the slices of this file, which is cpu intensive.
So my strategy is group this files in group of 4. For each group of these files, first, read all data of 4 files into memory with 4 processes in 4 cores. The code is like,
with Pool(processes=4) as pool:
data_list = pool.map(read_and_slice, files) # len(files)==4
Then for each data
in data_list
, do computation work with 4 processes.
for data in data_list: # I want to get the result of each data asap
with Pool(processes=4) as pool:
result_list = pool.map(compute, data.slices) # anaylyze each slice of data
analyze(result_list) # analyze the results of previous procedure, for example, get the average.
And then go for another group.
So the problem is that during the whole process of computation of hundreds of files, the pool is recreated many times. How could I avoid the overhead of recreating pools and processes? Is there any substantial memory overhead in my code? And is there a better way for me to make the time needed as less as possible?
Thanks!
While the Process keeps all the processes in the memory, the Pool keeps only those that are under execution. Therefore, if you have a large number of tasks, and if they have more data and take a lot of space too, then using process class might waste a lot of memory. The overhead of creating a Pool is more.
In this example, at first we import the Process class then initiate Process object with the display() function. Then process is started with start() method and then complete the process with the join() method. We can also pass arguments to the function using args keyword.
The Pool class in multiprocessing can handle an enormous number of processes. It allows you to run multiple jobs per process (due to its ability to queue the jobs). The memory is allocated only to the executing processes, unlike the Process class, which allocates memory to all the processes.
The Python Multiprocessing Module is a tool for you to increase your scripts' efficiency by allocating tasks to different processes.
One option is to move the with Pool
statement outside of the for
loop…
p = Pool()
for data in data_list:
result_list = pool.map(compute, data.slices)
analyze(result_list)
p.join()
p.close()
This works in python 2 or 3.
If you install (my module) pathos
, and then do from pathos.pools import ProcessPool as Pool
, and keep the rest of the code exactly as you have it -- you will only create one Pool
. This is because pathos
caches the Pool
, and when a new Pool
instance is created that has the same configuration, it just reuses the existing instance. You can do a pool.terminate()
to close it.
>>> from pathos.pools import ProcessPool as Pool
>>> pool = Pool()
>>> data_list = [range(4), range(4,8), range(8,12), range(12,16)]
>>> squared = lambda x:x**2
>>> mean = lambda x: sum(x)/len(x)
>>> for data in data_list:
... result = pool.map(squared, data)
... print mean(result)
...
3
31
91
183
Actually, pathos
enables you to do nested pools, so you could also convert your for
loop into a asynchronous map (amap
from pathos
)… and since the inner map doesn't need to preserve order you could use a unordered map iterator (imap_unordered
in multiprocessing
, or uimap
from pathos
). For examples, see here:
https://stackoverflow.com/questions/28203774/how-to-do-hierarchical-parallelism-in-ipython-parallel and here:
https://stackoverflow.com/a/31617653/2379433
Only bummer is pathos
is python2
. But will soon (pending release) will be fully converted to python3
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With