I'm trying to parallelize some calculations that use <code>numpy</code> with the help of Python's <code>multiprocessing</code> module. Consider this simplified example: <pre class="prettyprint"><code>import time import numpy from multiprocessing import Pool def test_func(i): a = numpy.random.normal(size=1000000) b = numpy.random.normal(size=1000000) for i in range(2000): a = a + b b = a - b a = a - b return 1 t1 = time.time() test_func(0) single_time = time.time() - t1 print("Single time:", single_time) n_par = 4 pool = Pool() t1 = time.time() results_async = [ pool.apply_async(test_func, [i]) for i in range(n_par)] results = [r.get() for r in results_async] multicore_time = time.time() - t1 print("Multicore time:", multicore_time) print("Efficiency:", single_time / multicore_time) </code></pre> When I execute it, the <code>multicore_time</code> is roughly equal to <code>single_time * n_par</code>, while I would expect it to be close to <code>single_time</code>. Indeed, if I replace <code>numpy</code> calculations with just <code>time.sleep(10)</code>, this is what I get — perfect efficiency. But for some reason it does not work with <code>numpy</code>. Can this be solved, or is it some internal limitation of <code>numpy</code>? Some additional info which may be useful: <ul> <li>I'm using OSX 10.9.5, Python 3.4.2 and the CPU is Core i7 with (as reported by the system info) 4 cores (although the above program only takes 50% of CPU time in total, so the system info may not be taking into account hyperthreading).</li> <li>when I run this I see <code>n_par</code> processes in <code>top</code> working at 100% CPU</li> <li>if I replace <code>numpy</code> array operations with a loop and per-index operations, the efficiency rises significantly (to about 75% for <code>n_par = 4</code>).</li> </ul>

One easy thing that should bump efficiency up should be to do in-place array operations, if possible -- so <code>add(a,b,a)</code> will not create a new array, while <code>a = a + b</code> will. If your for loop over <code>numpy</code> arrays could be rewritten as vector operations, that should be more efficient as well. Another possibility would be to use <code>numpy.ctypeslib</code> to enable shared memory numpy arrays (see: https://stackoverflow.com/a/5550156/2379433).

I have been programming numerical methods for mathematics and having the same problem: I wasn't seeing any speed-up for a supposedly cpu bounded problem. It turns out my problem was reaching the CPU cache memory limit. I have been using Intel PCM (Intel® Performance Counter Monitor) to see how the cpu cache memory was behaving (displaying it inside Linux ksysguard). I also disabled 2 of my processors to have clearer results (2 are active). Here is what I have found out with this code: <pre class="prettyprint"><code>def somethinglong(b): n=200000 m=5000 shared=np.arange(n) for i in np.arange(m): 0.01*shared pool = mp.Pool(2) jobs = [() for i in range(8)] for i in range(5): timei = time.time() pool.map(somethinglong, jobs , chunksize=1) #for job in jobs: #somethinglong(job) print(time.time()-timei) </code></pre> Example that doesn't reach the cache memory limit: <ul> <li>n=10000</li> <li>m=100000</li> <li>Sequential execution: 15s</li> <li>2 processor pool no cache memory limit: 8s</li> </ul> It can be seen that there is no cache misses (all cache hits), therefore the speed-up is almost perfect: 15/8. Memory cache hits 2 pool Example that reaches the cache memory limit: <ul> <li>n=200000 </li> <li>m=5000</li> <li>Sequential execution: 14s</li> <li>2 processor pool cache memory limit: 14s</li> </ul> In this case, I increased the size of the vector onto which we operate (and decreased the loop size, to see reasonable execution times). In this case we can see that the memory gets full and the processes always miss the cache memory. Therefore not getting any speedup: 15/15. Memory cache misses 2 pool Observation: assigning an operation to a variable (aux = 0.01*shared) also uses the cache memory and can bound the problem by memory (without increasing any vector size).

Inefficient multiprocessing of numpy-based calculations

Tags:

python

multiprocessing

numpy

I'm trying to parallelize some calculations that use numpy with the help of Python's multiprocessing module. Consider this simplified example:

import time
import numpy

from multiprocessing import Pool

def test_func(i):

    a = numpy.random.normal(size=1000000)
    b = numpy.random.normal(size=1000000)

    for i in range(2000):
        a = a + b
        b = a - b
        a = a - b

    return 1

t1 = time.time()
test_func(0)
single_time = time.time() - t1
print("Single time:", single_time)

n_par = 4
pool = Pool()

t1 = time.time()
results_async = [
    pool.apply_async(test_func, [i])
    for i in range(n_par)]
results = [r.get() for r in results_async]
multicore_time = time.time() - t1

print("Multicore time:", multicore_time)
print("Efficiency:", single_time / multicore_time)

When I execute it, the multicore_time is roughly equal to single_time * n_par, while I would expect it to be close to single_time. Indeed, if I replace numpy calculations with just time.sleep(10), this is what I get — perfect efficiency. But for some reason it does not work with numpy. Can this be solved, or is it some internal limitation of numpy?

Some additional info which may be useful:

I'm using OSX 10.9.5, Python 3.4.2 and the CPU is Core i7 with (as reported by the system info) 4 cores (although the above program only takes 50% of CPU time in total, so the system info may not be taking into account hyperthreading).
when I run this I see n_par processes in top working at 100% CPU
if I replace numpy array operations with a loop and per-index operations, the efficiency rises significantly (to about 75% for n_par = 4).

406

asked Mar 31 '15 01:03

fjarri

3 Answers

It looks like the test function you're using is memory bound. That means that the run time you're seeing is limited by how fast the computer can pull the arrays from memory into cache. For example, the line a = a + b is actually using 3 arrays, a, b and a new array that will replace a. These three arrays are about 8MB each (1e6 floats * 8 bytes per floats). I believe the different i7s have something like 3MB - 8MB of shared L3 cache so you cannot fit all 3 arrays in cache at once. Your cpu adds the floats faster than the array can be loaded into cache so most of the time is spent waiting on the array to be read from memory. Because the cache is shared between the cores, you don't see any speedup by spreading the work onto multiple cores.

Memory bound operations are an issue for numpy in general and the only way I know to deal with them is to use something like cython or numba.

answered Oct 04 '22 14:10

Bi Rico

One easy thing that should bump efficiency up should be to do in-place array operations, if possible -- so add(a,b,a) will not create a new array, while a = a + b will. If your for loop over numpy arrays could be rewritten as vector operations, that should be more efficient as well. Another possibility would be to use numpy.ctypeslib to enable shared memory numpy arrays (see: https://stackoverflow.com/a/5550156/2379433).

answered Oct 04 '22 16:10

Mike McKerns

I have been programming numerical methods for mathematics and having the same problem: I wasn't seeing any speed-up for a supposedly cpu bounded problem. It turns out my problem was reaching the CPU cache memory limit.

I have been using Intel PCM (Intel® Performance Counter Monitor) to see how the cpu cache memory was behaving (displaying it inside Linux ksysguard). I also disabled 2 of my processors to have clearer results (2 are active).

Here is what I have found out with this code:

def somethinglong(b):
    n=200000
    m=5000
    shared=np.arange(n)
    for i in np.arange(m):
        0.01*shared

pool = mp.Pool(2)
jobs = [() for i in range(8)]
for i in range(5):
    timei = time.time()
    pool.map(somethinglong, jobs , chunksize=1)
    #for job in jobs:
       #somethinglong(job)
print(time.time()-timei)

Example that doesn't reach the cache memory limit:

n=10000
m=100000
Sequential execution: 15s
2 processor pool no cache memory limit: 8s

It can be seen that there is no cache misses (all cache hits), therefore the speed-up is almost perfect: 15/8. Memory cache hits 2 pool

Example that reaches the cache memory limit:

n=200000
m=5000
Sequential execution: 14s
2 processor pool cache memory limit: 14s

In this case, I increased the size of the vector onto which we operate (and decreased the loop size, to see reasonable execution times). In this case we can see that the memory gets full and the processes always miss the cache memory. Therefore not getting any speedup: 15/15. Memory cache misses 2 pool

Observation: assigning an operation to a variable (aux = 0.01*shared) also uses the cache memory and can bound the problem by memory (without increasing any vector size).

answered Oct 04 '22 14:10

Marc Cayuela

Related questions
                            
                                Python psycopg2 copy_from() to load data throws error for null integer values: DataError: invalid input syntax for integer: ""
                            
                                Test doesn't raise ValidationError on Django model field
                            
                                Simplest way to get the first n elements of an iterator
                            
                                django rest-framework : can't get static files
                            
                                TypeError: object() takes no parameters
                            
                                Python threading self._stop() 'Event' object is not callable
                            
                                10 ,most frequent words in a string Python
                            
                                How to select all children text but excluding a tag with Scapy's XPath?
                            
                                What does ''except Exception as e'' mean in python? [closed]
                            
                                Filter values of dictionary [duplicate]
                            
                                Global variable is not defined - Python
                            
                                from pymongo.objectid import ObjectId ImportError: No module named objectid
                            
                                Python3 adds extra byte when printing hex values
                            
                                Prepending to list python
                            
                                Multilevel JSON diff in python
                            
                                sqlalchemy mysql connections not closing on flask api
                            
                                How to simulate from an (arbitrary) continuous probability distribution? [duplicate]
                            
                                Alternative to redis.keys(...)
                            
                                How do you "echo" quotes using python's os.system()?
                            
                                Replace values of a numpy array by values from another numpy array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With