I'm using python 2.7.10. I read lots of files, store them into a big list, then try to call multiprocessing and pass the big list to those multiprocesses so that each process can access this big list and do some calculation. I'm using Pool like this: <pre class="prettyprint"><code>def read_match_wrapper(args): args2 = args[0] + (args[1],) read_match(*args2) pool = multiprocessing.Pool(processes=10) result=pool.map(read_match_wrapper,itertools.izip(itertools.repeat((ped_list,chr_map,combined_id_to_id,chr)),range(10))) pool.close() pool.join() </code></pre> Basically, I'm passing multiple variables to 'read_match' function. In order to use pool.map, I write 'read_match_wrapper' function. I don't need any results back from those processes. I just want them to run and finish. I can get this whole process work when my data list 'ped_list' is quite small. When I load all the data, like 10G, then all the multiprocesses that it generates show 'S' and seems not working at all.. I don't know if there is a limit of how much data you can access through pool? I really need help on this! Thanks!

From the multiprocessing Programming guidelines: <blockquote> Avoid shared state <pre class="prettyprint"><code>As far as possible one should try to avoid shifting large amounts of data between processes. </code></pre> </blockquote> What you suffer from is a typical symptom of a full Pipe which does not get drained. The Python multiprocessing.Pipe used by the Pool has some design flaw. It basically implements a sort of message oriented protocol over an OS pipe which is more like a stream object. The result is that, if you send a too large object through the Pipe, it will get stuffed. The sender won't be able to add content to it and the receiver won't be able to drain it as it's blocked waiting for the end of the message. Proof is that your workers are sleeping waiting for that "fat" message which never arrives. Is ped_list containing the file names or the file contents? In the second case you'd rather send the file names instead of the content. The workers can retrieve the content themselves with a simple open().

python multiprocessing, big data turn process into sleep

Tags:

python

sleep

multiprocessing

pool

bigdata

I'm using python 2.7.10. I read lots of files, store them into a big list, then try to call multiprocessing and pass the big list to those multiprocesses so that each process can access this big list and do some calculation.

I'm using Pool like this:

def read_match_wrapper(args):
    args2 = args[0] + (args[1],)
    read_match(*args2)

 pool = multiprocessing.Pool(processes=10)
 result=pool.map(read_match_wrapper,itertools.izip(itertools.repeat((ped_list,chr_map,combined_id_to_id,chr)),range(10)))
 pool.close()
 pool.join()

Basically, I'm passing multiple variables to 'read_match' function. In order to use pool.map, I write 'read_match_wrapper' function. I don't need any results back from those processes. I just want them to run and finish.

I can get this whole process work when my data list 'ped_list' is quite small. When I load all the data, like 10G, then all the multiprocesses that it generates show 'S' and seems not working at all..

I don't know if there is a limit of how much data you can access through pool? I really need help on this! Thanks!

296

asked Jul 02 '15 19:07

odeya

1 Answers

From the multiprocessing Programming guidelines:

Avoid shared state

As far as possible one should try to avoid shifting large amounts of data between processes.

What you suffer from is a typical symptom of a full Pipe which does not get drained.

The Python multiprocessing.Pipe used by the Pool has some design flaw. It basically implements a sort of message oriented protocol over an OS pipe which is more like a stream object.

The result is that, if you send a too large object through the Pipe, it will get stuffed. The sender won't be able to add content to it and the receiver won't be able to drain it as it's blocked waiting for the end of the message.

Proof is that your workers are sleeping waiting for that "fat" message which never arrives.

Is ped_list containing the file names or the file contents?

In the second case you'd rather send the file names instead of the content. The workers can retrieve the content themselves with a simple open().

119

answered Sep 20 '22 00:09

noxdafox

Related questions
                            
                                Python: DLL load failed: %1 is not a valid Win32 application
                            
                                Access static method from static variable
                            
                                Python: Convert multiple instances of the same key into multiple rows
                            
                                Operation 10**(-9) correct in python, but wrong in Cython
                            
                                How do I pass Boolean values created in Python to MongoDB?
                            
                                Is there a built-in Python function which will return the first True-ish value when mapping a function over an iterable?
                            
                                How can I ensure my ttk.Entry's invalid state isn't cleared when it loses focus?
                            
                                Youtube API error v3 - 'No Filter Selected'
                            
                                Debugging linked docker containers when using docker-compose
                            
                                Pip name conflict
                            
                                How to perform file-locking on Windows without installing a new package
                            
                                How to save Python coding in Command Prompt as a file?
                            
                                Django model fields unique=True and default=function
                            
                                Python: Yield in multiprocessing Pool
                            
                                Fastest way to find compute function on DataFrame slices by column value (Python pandas)
                            
                                Python class and global vs local variables [duplicate]
                            
                                Running python script without installed libraries
                            
                                How to create thumbnails using opencv-python?
                            
                                Can you update QPython3 version of Python3?
                            
                                Django doesn't parse a custom http accept header

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With