Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

multiprocess or multithread? - parallelizing a simple computation for millions of iterations and storing the result in a single data structure

I have a dictionary D of {string:list} entries, and I compute a function f( D[s1],D[s2] ) --> float for a pair of strings (s1,s2) in D.

Additionally, I have created a custom matrix class LabeledNumericMatrix that allows me to perform assignments such as m[ ID1, ID2 ] = 1.0 .

I need to calculate f(x,y) and store the result in m[x,y] for all 2-tuples in the set of strings S, including when s1=s2. This is easy to code as a loop, but execution of this code takes quite some time as the size of the set S grows to large values such as 10,000 or more.

None of the results I store in my labeled matrix m depend on each other. Therefore, it seems straightforward to parallelize this computation by using python's multithread or multiprocess services. However, since cPython doesn't truly allow my to simultaneously execute calculation of f(x,y) and storage of m[x,y] through threading, it seems that multiprocess is my only choice. However, I don't think multiprocess is designed to pass around 1GB data structures between processes, such as my labelled matrix structure containing 10000x10000 elements.

Can anyone provide advice of (a) if I should avoid trying to parallelize my algorithm, and (b) if I can do the parallelization, how to do such, preferably in cPython?

like image 642
J.B. Brown Avatar asked Feb 20 '12 10:02

J.B. Brown


2 Answers

First option - a Server Process

Create a Server process. It's part of the Multiprocessing package which allows parallel access to data structures. This way every process will access the data structure directly, locking other processes.

From the documentation:

Server process

A manager object returned by Manager() controls a server process which holds Python objects and allows other processes to manipulate them using proxies.

A manager returned by Manager() will support types list, dict, Namespace, Lock, RLock, Semaphore, BoundedSemaphore, Condition, Event, Queue, Value and Array.

Second option - Pool of workers

Create a Pool of workers, an input Queue and a result Queue.

  • The main process, acting as a producer, will feed the input queue with pairs (s1, s2).
  • Each worker process will read a pair from the input Queue, and write the result into the output Queue.
  • The main thread will read the results from the result Queue, and write them into the result dictionary.

Third option - divide to independent problems

Your data is independent: f( D[si],D[sj] ) is a secluded problem, independent of any f( D[sk],D[sl] ) . furthermore, the computation time of each pair should be fairly equal, or at least in the same order of magnitude.

Divide the task into n inputs sets, where n is the number of computation units (cores, or even computers) you have. Give each input set to a different process, and join the output.

like image 81
Adam Matan Avatar answered Sep 20 '22 00:09

Adam Matan


You definitely won't get any performance increase with threading - it is an inappropriate tool for cpu-bound tasks.

So the only possible choice is multiprocessing, but since you have a big data structure, I'd suggest something like mmap (pretty low level, but builtin) or Redis (tasty and high level API, but should be installed and configured).

like image 44
Roman Bodnarchuk Avatar answered Sep 20 '22 00:09

Roman Bodnarchuk