Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to reuse a process pool for parallel programming in Python 3

I'm new to parrallel programming. My task is to analyze hundreds of data files. Each of those data is nearly 300MB, and could be sliced into numerous slices. My computer is a 4-core pc. And I want to get the result of each data as soon as possible.
The analysis of each data file consists of 2 procedures. First, read data into memory, and then slice it into slices, which is io intensive work. Then, do lots of computation for the slices of this file, which is cpu intensive.
So my strategy is group this files in group of 4. For each group of these files, first, read all data of 4 files into memory with 4 processes in 4 cores. The code is like,

with Pool(processes=4) as pool:
    data_list = pool.map(read_and_slice, files)  # len(files)==4

Then for each data in data_list, do computation work with 4 processes.

for data in data_list:  # I want to get the result of each data asap
    with Pool(processes=4) as pool:
        result_list = pool.map(compute, data.slices)  # anaylyze each slice of data
    analyze(result_list)  # analyze the results of previous procedure, for example, get the average.

And then go for another group.
So the problem is that during the whole process of computation of hundreds of files, the pool is recreated many times. How could I avoid the overhead of recreating pools and processes? Is there any substantial memory overhead in my code? And is there a better way for me to make the time needed as less as possible?

Thanks!

like image 935
ProtossShuttle Avatar asked Jan 01 '16 03:01

ProtossShuttle


People also ask

What is the difference between pool and process in python?

While the Process keeps all the processes in the memory, the Pool keeps only those that are under execution. Therefore, if you have a large number of tasks, and if they have more data and take a lot of space too, then using process class might waste a lot of memory. The overhead of creating a Pool is more.

How does Python handle multiprocessing?

In this example, at first we import the Process class then initiate Process object with the display() function. Then process is started with start() method and then complete the process with the join() method. We can also pass arguments to the function using args keyword.

What is Python multiprocessing pool?

The Pool class in multiprocessing can handle an enormous number of processes. It allows you to run multiple jobs per process (due to its ability to queue the jobs). The memory is allocated only to the executing processes, unlike the Process class, which allocates memory to all the processes.

Is multiprocessing part of Python?

The Python Multiprocessing Module is a tool for you to increase your scripts' efficiency by allocating tasks to different processes.


1 Answers

One option is to move the with Pool statement outside of the for loop…

p = Pool()
for data in data_list:
  result_list = pool.map(compute, data.slices)
  analyze(result_list)
p.join()
p.close()

This works in python 2 or 3.

If you install (my module) pathos, and then do from pathos.pools import ProcessPool as Pool, and keep the rest of the code exactly as you have it -- you will only create one Pool. This is because pathos caches the Pool, and when a new Pool instance is created that has the same configuration, it just reuses the existing instance. You can do a pool.terminate() to close it.

>>> from pathos.pools import ProcessPool as Pool
>>> pool = Pool()
>>> data_list = [range(4), range(4,8), range(8,12), range(12,16)]
>>> squared = lambda x:x**2
>>> mean = lambda x: sum(x)/len(x)
>>> for data in data_list:
...   result = pool.map(squared, data)
...   print mean(result)
... 
3
31
91
183

Actually, pathos enables you to do nested pools, so you could also convert your for loop into a asynchronous map (amap from pathos)… and since the inner map doesn't need to preserve order you could use a unordered map iterator (imap_unordered in multiprocessing, or uimap from pathos). For examples, see here: https://stackoverflow.com/questions/28203774/how-to-do-hierarchical-parallelism-in-ipython-parallel and here: https://stackoverflow.com/a/31617653/2379433

Only bummer is pathos is python2. But will soon (pending release) will be fully converted to python3.

like image 112
Mike McKerns Avatar answered Oct 01 '22 20:10

Mike McKerns