Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sharing data between processes in Python

I have a complex data structure (user-defined type) on which a large number of independent calculations are performed. The data structure is basically immutable. I say basically, because though the interface looks immutable, internally some lazy-evaluation is going on. Some of the lazily calculated attributes are stored in dictionaries (return values of costly functions by input parameter). I would like to use Pythons multiprocessing module to parallelize these calculations. There are two questions on my mind.

  1. How do I best share the data-structure between processes?
  2. Is there a way to handle the lazy-evaluation problem without using locks (multiple processes write the same value)?

Thanks in advance for any answers, comments or enlightening questions!

like image 346
Björn Pollex Avatar asked Aug 10 '10 10:08

Björn Pollex


People also ask

How do I share data between two processes in Python?

Passing Messages to Processes A simple way to communicate between process with multiprocessing is to use a Queue to pass messages back and forth. Any pickle-able object can pass through a Queue. This short example only passes a single message to a single worker, then the main process waits for the worker to finish.

How do I share data between processes?

To share big data: named shared-memory The fastest way to handle this is to create a file mapping object, map the file object into memory space, notify the parent process of the file mapping handle and data size, and then dump the data to the mapped buffer for the parent process to read.

How do you communicate two processes in Python?

Every object has two methods – send() and recv(), to communicate between processes.

Do Python processes share memory?

Shared memory can be a very efficient way of handling data in a program that uses concurrency. Python's mmap uses shared memory to efficiently share large amounts of data between multiple Python processes, threads, and tasks that are happening concurrently.


1 Answers

How do I best share the data-structure between processes?

Pipelines.

origin.py | process1.py | process2.py | process3.py

Break your program up so that each calculation is a separate process of the following form.

def transform1( piece ):
    Some transformation or calculation.

For testing, you can use it like this.

def t1( iterable ):
    for piece in iterable:
        more_data = transform1( piece )
        yield NewNamedTuple( piece, more_data )

For reproducing the whole calculation in a single process, you can do this.

for x in t1( t2( t3( the_whole_structure ) ) ):
    print( x )

You can wrap each transformation with a little bit of file I/O. Pickle works well for this, but other representations (like JSON or YAML) work well, too.

while True:
    a_piece = pickle.load(sys.stdin)
    more_data = transform1( a_piece )
    pickle.dump( NewNamedTuple( piece, more_data ) )

Each processing step becomes an independent OS-level process. They will run concurrently and will -- immediately -- consume all OS-level resources.

Is there a way to handle the lazy-evaluation problem without using locks (multiple processes write the same value)?

Pipelines.

like image 82
S.Lott Avatar answered Sep 17 '22 14:09

S.Lott