Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python multiprocessing, big data turn process into sleep

I'm using python 2.7.10. I read lots of files, store them into a big list, then try to call multiprocessing and pass the big list to those multiprocesses so that each process can access this big list and do some calculation.

I'm using Pool like this:

def read_match_wrapper(args):
    args2 = args[0] + (args[1],)
    read_match(*args2)

 pool = multiprocessing.Pool(processes=10)
 result=pool.map(read_match_wrapper,itertools.izip(itertools.repeat((ped_list,chr_map,combined_id_to_id,chr)),range(10)))
 pool.close()
 pool.join()

Basically, I'm passing multiple variables to 'read_match' function. In order to use pool.map, I write 'read_match_wrapper' function. I don't need any results back from those processes. I just want them to run and finish.

I can get this whole process work when my data list 'ped_list' is quite small. When I load all the data, like 10G, then all the multiprocesses that it generates show 'S' and seems not working at all..

I don't know if there is a limit of how much data you can access through pool? I really need help on this! Thanks!

like image 296
odeya Avatar asked Jul 02 '15 19:07

odeya


People also ask

How do you stop a multiprocessing process in Python?

A process can be killed by calling the Process. terminate() function. The call will only terminate the target process, not child processes. The method is called on the multiprocessing.

Can you multiprocess in Python?

Python multiprocessing Pool can be used for parallel execution of a function across multiple input values, distributing the input data across processes (data parallelism).

How do you close a multiprocessing pool?

The process pool can be shutdown by calling the Pool. close() function. This will prevent the pool from accepting new tasks. Once all issued tasks are completed, the resources of the process pool, such as the child worker processes, will be released.

How many processes should be running Python multiprocessing?

If we are using the context manager to create the process pool so that it is automatically shutdown, then you can configure the number of processes in the same manner. The number of workers must be less than or equal to 61 if Windows is your operating system.


1 Answers

From the multiprocessing Programming guidelines:

Avoid shared state

As far as possible one should try to avoid shifting large amounts of data between processes.

What you suffer from is a typical symptom of a full Pipe which does not get drained.

The Python multiprocessing.Pipe used by the Pool has some design flaw. It basically implements a sort of message oriented protocol over an OS pipe which is more like a stream object.

The result is that, if you send a too large object through the Pipe, it will get stuffed. The sender won't be able to add content to it and the receiver won't be able to drain it as it's blocked waiting for the end of the message.

Proof is that your workers are sleeping waiting for that "fat" message which never arrives.

Is ped_list containing the file names or the file contents?

In the second case you'd rather send the file names instead of the content. The workers can retrieve the content themselves with a simple open().

like image 119
noxdafox Avatar answered Sep 20 '22 00:09

noxdafox