numba - guvectorize barely faster than jit

Tags:

I was trying to parallellize a Monte Carlo simulation that operates on many independent datasets. I found out that numba's parallel guvectorize implementation was barely 30-40% faster than the numba jit implementation.

I found these (1, 2) comparable topics on Stackoverflow, but they do not really answer my question. In the first case, the implementation is slowed down by a fall back to object mode and in the second case the original poster did not properly use guvectorize - none of these problems apply to my code.

To make sure there was no problem with my code, I created this very simple piece of code to compare jit to guvectorize:

import timeit
import numpy as np
from numba import jit, guvectorize

#both functions take an (m x n) array as input, compute the row sum, and return the row sums in a (m x 1) array

@guvectorize(["void(float64[:], float64[:])"], "(n) -> ()", target="parallel", nopython=True)
def row_sum_gu(input, output) :
    output[0] = np.sum(input)

@jit(nopython=True)
def row_sum_jit(input_array, output_array) :
    m, n = input_array.shape
    for i in range(m) :
        output_array[i] = np.sum(input_array[i,:])

rows = int(64) #broadcasting (= supposed parallellization) dimension for guvectorize
columns = int(1e6)
input_array = np.ones((rows, columns))
output_array = np.zeros((rows))
output_array2 = np.zeros((rows))

#the first run includes the compile time
row_sum_jit(input_array, output_array)
row_sum_gu(input_array, output_array2)

#run each function 100 times and record the time
print("jit time:", timeit.timeit("row_sum_jit(input_array, output_array)", "from __main__ import row_sum_jit, input_array, output_array", number=100))
print("guvectorize time:", timeit.timeit("row_sum_gu(input_array, output_array2)", "from __main__ import row_sum_gu, input_array, output_array2", number=100))

This gives me the following output (the times do vary a bit):

jit time: 12.04114792868495
guvectorize time: 5.415564753115177

Thus again, the parallel code is barely two times faster (only when the number of rows is an integer multiple of the number of CPU cores, otherwise the performance advantage diminishes) even though it utilizes all cpu cores and the jit code only uses one (verified using htop).

I am running this on a machine with 4x AMD Opteron 6380 CPU (so 64 cores in total), 256 GB of RAM and Red Hat 4.4.7-1 OS. I use Anaconda 4.2.0 with Python 3.5.2 and Numba 0.26.0.

How can I further improve the parallel performance or what am I doing wrong?

Thank you for your answers.

423

asked Jan 23 '17 10:01

Dries Van Laethem

1 Answers

That's because np.sum is too simple. Processing an array with sum is not only limited by CPU but also by the "memory access" time. So throwing more cores at it doesn't make much of a difference (of course that depends on how fast the memory access in relation to your CPU is).

Just for vizualisation np.sum is something like this (ignoring any parameter other than the data):

def sum(data):
    sum_ = 0.
    data = data.ravel()
    for i in data.size:
        item = data[i]   # memory access (I/O bound)
        sum_ += item     # addition      (CPU bound)
    return sum

So if most of the time is spent accessing the memory you won't see any real speedups if you parallize it. However if the CPU bound task is the bottleneck then using more cores will speedup your code significantly.

For example if you include some slower operations than addition you'll see a bigger improvement:

from math import sqrt
from numba import njit, jit, guvectorize
import timeit
import numpy as np

@njit
def square_sum(arr):
    a = 0.
    for i in range(arr.size):
        a = sqrt(a**2 + arr[i]**2)  # sqrt and square are cpu-intensive!
    return a

@guvectorize(["void(float64[:], float64[:])"], "(n) -> ()", target="parallel", nopython=True)
def row_sum_gu(input, output) :
    output[0] = square_sum(input)

@jit(nopython=True)
def row_sum_jit(input_array, output_array) :
    m, n = input_array.shape
    for i in range(m) :
        output_array[i] = square_sum(input_array[i,:])
    return output_array

I used IPythons timeit here but it should be equivalent:

rows = int(64)
columns = int(1e6)

input_array = np.random.random((rows, columns))
output_array = np.zeros((rows))

# Warmup an check that they are equal 
np.testing.assert_equal(row_sum_jit(input_array, output_array), row_sum_gu(input_array, output_array2))
%timeit row_sum_jit(input_array, output_array.copy())  # 10 loops, best of 3: 130 ms per loop
%timeit row_sum_gu(input_array, output_array.copy())   # 10 loops, best of 3: 35.7 ms per loop

I'm only using 4 cores so that's pretty close to the limit of possible speedup!

Just remember that parallel computation can only significantly speedup your calculation if the job is limited by the CPU.

179

answered Nov 03 '22 00:11

MSeifert

Related questions
                            
                                How to construct pandas dataframe from series of arrays
                            
                                Import own .py files in anaconda spyder
                            
                                How can a Python module single file be installed using pip and PyPI?
                            
                                Why does TensorFlow return [[nan nan]] instead of probabilities from a CSV file?
                            
                                Fast alternative to run a numpy based function over all the rows in Pandas DataFrame
                            
                                How to pass callable in Django 1.9
                            
                                Django no module named "compressor"
                            
                                how to plot a line in python with an interval at each data point
                            
                                Storing keys with prefix that expire in redis
                            
                                Why while updating a dictionary getting None? [duplicate]
                            
                                How to remove carriage return in a dataframe
                            
                                Install gobject module?
                            
                                Comparing multiple numpy arrays
                            
                                Keras neural network outputs same result for every input
                            
                                Replace specific values in a dataframe column using Pandas
                            
                                Python: Good way to get function and all dependencies in a single file?
                            
                                gitpython: Command syntax for git commit
                            
                                Pandas: Cumulative return function
                            
                                Argument 1 has unexpected type 'NoneType'?
                            
                                run django in docker container

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

numba - guvectorize barely faster than jit

Tags:

performance

python

parallel-processing

numpy

numba

Dries Van Laethem

People also ask

1 Answers

MSeifert

Recent Activity

Donate For Us