I am trying to run parallel processes under python (on ubuntu).
I started using multiprocessing and it worked fine for simple examples.
Then came the pickle error, and so I switched to pathos. I got a little confused with the different options and so wrote a very simple benchmarking code.
import multiprocessing as mp
from pathos.multiprocessing import Pool as Pool1
from pathos.pools import ParallelPool as Pool2
from pathos.parallel import ParallelPool as Pool3
import time
def square(x):
# calculate the square of the value of x
return x*x
if __name__ == '__main__':
dataset = range(0,10000)
start_time = time.time()
for d in dataset:
square(d)
print('test with no cores: %s seconds' %(time.time() - start_time))
nCores = 3
print('number of cores used: %s' %(nCores))
start_time = time.time()
p = mp.Pool(nCores)
p.map(square, dataset)
# Close
p.close()
p.join()
print('test with multiprocessing: %s seconds' %(time.time() - start_time))
start_time = time.time()
p = Pool1(nCores)
p.map(square, dataset)
# Close
p.close()
p.join()
print('test with pathos multiprocessing: %s seconds' %(time.time() - start_time))
start_time = time.time()
p = Pool2(nCores)
p.map(square, dataset)
# Close
p.close()
p.join()
print('test with pathos pools: %s seconds' %(time.time() - start_time))
start_time = time.time()
p = Pool3()
p.ncpus = nCores
p.map(square, dataset)
# Close
p.close()
p.join()
print('test with pathos parallel: %s seconds' %(time.time() - start_time))
I get about
- 0.001s with plain serial code, without parallel,
- 0.100s with multiprocessing
option,
- 0.100s with pathos.multiprocessing
,
- 4.470s with pathos.pools
,
- an AssertionError
error with pathos.parallel
I copied how to use these various options from http://trac.mystic.cacr.caltech.edu/project/pathos/browser/pathos/examples.html
I understand that parallel processing is longer than a plain serial code for such a simple example. What I do not understand is the relative performance of pathos.
I checked discussions, but could not understand why pathos.pools
is so much longer, and why I get an error (not sure then what the performance of that last option would be).
I also tried with a simple square function, and for that even pathos.multiprocessing
is much longer than multiprocessing
Could someone explain the differences between these various options?
Additionally, I ran the pathos.multiprocessing
option on a remote computer, running centOS, and performance is about 10 times worse than multiprocessing
.
According to company renting the computer, it should work just like a home computer. I understand that it will, maybe, be difficult to provide info without more details on the machine, but if you have any ideas as to where it could come from, that would help.
I'm the pathos
author. Sorry for the confusion. You are dealing with a mix of the old and new programming interface.
The "new" (suggested) interface is to use pathos.pools
. The old interface links to the same objects, so it's really two ways to get to the same thing.
multiprocess.Pool
is a fork of multiprocessing.Pool
, with the only difference being that multiprocessing
uses pickle
and multiprocess
uses dill
. So, I'd expect the speed to be the same in most simple cases.
The above pool can also be found at pathos.pools._ProcessPool
. pathos
provides a small wrapper around several types of pools, with different backends, giving an extended functionality. The pathos
-wrapped pool is pathos.pools.ProcessPool
(and the old interface provides it at pathos.multiprocessing.Pool
).
The preferred interface is pathos.pools.ProcessPool
.
There's also the ParallelPool
, which uses a different backend -- it uses ppft
instead of multiprocess
. ppft
is "parallel python" which spawns python processes through subprocess
and passes source code (with dill.source
instead of serialized objects) -- it's intended for distributed computing, or when passing by source code is a better option.
So, pathos.pools.ParallelPool
is the preferred interface, and pathos.parallel.ParallelPool
(and a few other similar references in pathos
) are hanging around for legacy reasons -- but they are the same object underneath.
In summary:
>>> import multiprocessing as mp
>>> mp.Pool()
<multiprocessing.pool.Pool object at 0x10fa6b6d0>
>>> import multiprocess as mp
>>> mp.Pool()
<multiprocess.pool.Pool object at 0x11000c910>
>>> import pathos as pa
>>> pa.pools._ProcessPool()
<multiprocess.pool.Pool object at 0x11008b0d0>
>>> pa.multiprocessing.Pool()
<multiprocess.pool.Pool object at 0x11008bb10>
>>> pa.pools.ProcessPool()
<pool ProcessPool(ncpus=4)>
>>> pa.pools.ParallelPool()
<pool ParallelPool(ncpus=*, servers=None)>
You can see the ParallelPool
has servers
... thus is intended for distributed computing.
The only remaining question is why the AssertionError
? Well that is because the wrapper that pathos
adds keeps a pool object available for reuse. Hence, when you call the ParallelPool
a second time, you are calling a closed pool. You'd need to restart
the pool to enable it to be used again.
>>> f = lambda x:x
>>> p = pa.pools.ParallelPool()
>>> p.map(f, [1,2,3])
[1, 2, 3]
>>> p.close()
>>> p.join()
>>> p.restart() # throws AssertionError w/o this
>>> p.map(f, [1,2,3])
[1, 2, 3]
>>> p.close()
>>> p.join()
>>> p.clear() # destroy the saved pool
The ProcessPool
has the same interface as ParallelPool
, with respect to restarting and clearing saved instances.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With