I have a dictionary D of {string:list} entries, and I compute a function f( D[s1],D[s2] ) --> float for a pair of strings (s1,s2) in D.
Additionally, I have created a custom matrix class LabeledNumericMatrix that allows me to perform assignments such as m[ ID1, ID2 ] = 1.0 .
I need to calculate f(x,y) and store the result in m[x,y] for all 2-tuples in the set of strings S, including when s1=s2. This is easy to code as a loop, but execution of this code takes quite some time as the size of the set S grows to large values such as 10,000 or more.
None of the results I store in my labeled matrix m depend on each other. Therefore, it seems straightforward to parallelize this computation by using python's multithread or multiprocess services. However, since cPython doesn't truly allow my to simultaneously execute calculation of f(x,y) and storage of m[x,y] through threading, it seems that multiprocess is my only choice. However, I don't think multiprocess is designed to pass around 1GB data structures between processes, such as my labelled matrix structure containing 10000x10000 elements.
Can anyone provide advice of (a) if I should avoid trying to parallelize my algorithm, and (b) if I can do the parallelization, how to do such, preferably in cPython?
Create a Server process. It's part of the Multiprocessing package which allows parallel access to data structures. This way every process will access the data structure directly, locking other processes.
From the documentation:
Server process
A manager object returned by Manager() controls a server process which holds Python objects and allows other processes to manipulate them using proxies.
A manager returned by Manager() will support types list, dict, Namespace, Lock, RLock, Semaphore, BoundedSemaphore, Condition, Event, Queue, Value and Array.
Create a Pool of workers, an input Queue and a result Queue.
Your data is independent: f( D[si],D[sj] ) is a secluded problem, independent of any f( D[sk],D[sl] ) . furthermore, the computation time of each pair should be fairly equal, or at least in the same order of magnitude.
Divide the task into n inputs sets, where n is the number of computation units (cores, or even computers) you have. Give each input set to a different process, and join the output.
You definitely won't get any performance increase with threading
- it is an inappropriate tool for cpu-bound tasks.
So the only possible choice is multiprocessing
, but since you have a big data structure, I'd suggest something like mmap
(pretty low level, but builtin) or Redis
(tasty and high level API, but should be installed and configured).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With