Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pathos: parallel processing options - Could someone explain the differences?

I am trying to run parallel processes under python (on ubuntu).

I started using multiprocessing and it worked fine for simple examples.
Then came the pickle error, and so I switched to pathos. I got a little confused with the different options and so wrote a very simple benchmarking code.

import multiprocessing as mp
from pathos.multiprocessing import Pool as Pool1
from pathos.pools import ParallelPool as Pool2
from pathos.parallel import ParallelPool as Pool3
import time

def square(x):  
    # calculate the square of the value of x
    return x*x

if __name__ == '__main__':

    dataset = range(0,10000)

    start_time = time.time()
    for d in dataset:
        square(d)
    print('test with no cores: %s seconds' %(time.time() - start_time))

    nCores = 3
    print('number of cores used: %s' %(nCores))  


    start_time = time.time()

    p = mp.Pool(nCores)
    p.map(square, dataset)

    # Close
    p.close()
    p.join()

    print('test with multiprocessing: %s seconds' %(time.time() - start_time))


    start_time = time.time()

    p = Pool1(nCores)
    p.map(square, dataset)

    # Close
    p.close()
    p.join()

    print('test with pathos multiprocessing: %s seconds' %(time.time() - start_time))


    start_time = time.time()

    p = Pool2(nCores)
    p.map(square, dataset)

    # Close
    p.close()
    p.join()

    print('test with pathos pools: %s seconds' %(time.time() - start_time))


    start_time = time.time()

    p = Pool3()
    p.ncpus = nCores
    p.map(square, dataset)

    # Close
    p.close()
    p.join()

    print('test with pathos parallel: %s seconds' %(time.time() - start_time))

I get about
- 0.001s with plain serial code, without parallel,
- 0.100s with multiprocessing option,
- 0.100s with pathos.multiprocessing,
- 4.470s with pathos.pools,
- an AssertionError error with pathos.parallel

I copied how to use these various options from http://trac.mystic.cacr.caltech.edu/project/pathos/browser/pathos/examples.html

I understand that parallel processing is longer than a plain serial code for such a simple example. What I do not understand is the relative performance of pathos.

I checked discussions, but could not understand why pathos.pools is so much longer, and why I get an error (not sure then what the performance of that last option would be).

I also tried with a simple square function, and for that even pathos.multiprocessing is much longer than multiprocessing

Could someone explain the differences between these various options?

Additionally, I ran the pathos.multiprocessing option on a remote computer, running centOS, and performance is about 10 times worse than multiprocessing.

According to company renting the computer, it should work just like a home computer. I understand that it will, maybe, be difficult to provide info without more details on the machine, but if you have any ideas as to where it could come from, that would help.

like image 202
Olivier Avatar asked Dec 11 '22 07:12

Olivier


1 Answers

I'm the pathos author. Sorry for the confusion. You are dealing with a mix of the old and new programming interface.

The "new" (suggested) interface is to use pathos.pools. The old interface links to the same objects, so it's really two ways to get to the same thing.

multiprocess.Pool is a fork of multiprocessing.Pool, with the only difference being that multiprocessing uses pickle and multiprocess uses dill. So, I'd expect the speed to be the same in most simple cases.

The above pool can also be found at pathos.pools._ProcessPool. pathos provides a small wrapper around several types of pools, with different backends, giving an extended functionality. The pathos-wrapped pool is pathos.pools.ProcessPool (and the old interface provides it at pathos.multiprocessing.Pool).

The preferred interface is pathos.pools.ProcessPool.

There's also the ParallelPool, which uses a different backend -- it uses ppft instead of multiprocess. ppft is "parallel python" which spawns python processes through subprocess and passes source code (with dill.source instead of serialized objects) -- it's intended for distributed computing, or when passing by source code is a better option.

So, pathos.pools.ParallelPool is the preferred interface, and pathos.parallel.ParallelPool (and a few other similar references in pathos) are hanging around for legacy reasons -- but they are the same object underneath.

In summary:

>>> import multiprocessing as mp
>>> mp.Pool()
<multiprocessing.pool.Pool object at 0x10fa6b6d0>
>>> import multiprocess as mp
>>> mp.Pool()
<multiprocess.pool.Pool object at 0x11000c910>
>>> import pathos as pa
>>> pa.pools._ProcessPool()
<multiprocess.pool.Pool object at 0x11008b0d0>
>>> pa.multiprocessing.Pool()
<multiprocess.pool.Pool object at 0x11008bb10>
>>> pa.pools.ProcessPool()
<pool ProcessPool(ncpus=4)>
>>> pa.pools.ParallelPool()
<pool ParallelPool(ncpus=*, servers=None)>

You can see the ParallelPool has servers... thus is intended for distributed computing.

The only remaining question is why the AssertionError? Well that is because the wrapper that pathos adds keeps a pool object available for reuse. Hence, when you call the ParallelPool a second time, you are calling a closed pool. You'd need to restart the pool to enable it to be used again.

>>> f = lambda x:x
>>> p = pa.pools.ParallelPool()
>>> p.map(f, [1,2,3])
[1, 2, 3]
>>> p.close()
>>> p.join()
>>> p.restart()  # throws AssertionError w/o this
>>> p.map(f, [1,2,3])
[1, 2, 3]
>>> p.close()
>>> p.join()
>>> p.clear()  # destroy the saved pool

The ProcessPool has the same interface as ParallelPool, with respect to restarting and clearing saved instances.

like image 185
Mike McKerns Avatar answered Jan 26 '23 01:01

Mike McKerns