Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Numba slow when assigning to an array?

Numba seems to be a great solution for accelerating the execution of numeric code. However, when there are assignments to an array Numba seems to be slower than standard Python code. Consider this example comparing four alternatives, with/without Numba, writing to an array/scalar:

(The calculations were kept very simple on purpose, to focus on the issue, which is assignment to a scalar vs assignment to an array cell)

@autojit
def fast_sum_arr(arr):
    z = arr.copy()
    M = len(arr)
    for i in range(M):
        z[i] += arr[i]

    return z

def sum_arr(arr):
    z = arr.copy()
    M = len(arr)
    for i in range(M):
        z[i] += arr[i]

    return z

@autojit
def fast_sum_sclr(arr):
    z = 0
    M = len(arr)
    for i in range(M):
        z += arr[i]

    return z

def sum_sclr(arr):
    z = 0
    M = len(arr)
    for i in range(M):
        z += arr[i]

    return z

Using IPython's %timeit to evaluate the four alternatives I got:

In [125]: %timeit fast_sum_arr(arr)
100 loops, best of 3: 10.8 ms per loop

In [126]: %timeit sum_arr(arr)
100 loops, best of 3: 4.11 ms per loop

In [127]: %timeit fast_sum_sclr(arr)
100000 loops, best of 3: 10 us per loop

In [128]: %timeit sum_sclr(arr)
100 loops, best of 3: 2.93 ms per loop

sum_arr, which was not compiled with Numba is more than twice as fast as fast_sum_arr, which was compiled with Numba. On the other hand, fast_sum_sclr, which was compiled with Numba is more than two orders of magnitude faster than sum_sclr, which was not compiled with Numba.

So Numba performs remarkably well the task of accelerating sum_sclr but actually makes sum_arr execute slower. The only difference between sum_sclr and sum_arr is that the former assigns to a scalar while the latter assigns to an array cell.

I don't know if there is any relation, but I recently read the following on the blog http://www.phi-node.com/:

"It turns out that when Numba is confronted with any construct it doesn't support directly, it switches to a (very) slow code path."

The blog author got Numba to perform much faster using an if statement instead of Python's max().

Any insights on this?

Thanks,

FS

like image 395
Soldalma Avatar asked Aug 06 '13 03:08

Soldalma


2 Answers

What is slow here is the arr.copy() function, not the write access to an array. Proof:

# -*- coding: utf-8 -*-
from numba import autojit
from Timer import Timer
import numpy as np

@autojit
def fast_sum_arr(arr, z):
    #z = arr.copy()
    M = len(arr)
    for i in range(M):
        z[i] += arr[i]

    return z

def sum_arr(arr, z):
    #z = arr.copy()
    M = len(arr)
    for i in range(M):
        z[i] += arr[i]

    return z

@autojit
def fast_sum_sclr(arr):
    z = 0
    M = len(arr)
    for i in range(M):
        z += arr[i]

    return z

def sum_sclr(arr):
    z = 0
    M = len(arr)
    for i in range(M):
        z += arr[i]

    return z

if __name__ == '__main__':
    vec1 = np.ones(1000)
    z = vec1.copy()
    with Timer() as t0:
        for i in range(10000):
            pass
    print "time for empty loop ", t0.secs
    print
    with Timer() as t1:
        for i in range(10000):
            sum_arr(vec1, z)
    print "time for sum_arr  [µs]:   ", (t1.secs-t0.secs)  / 10000 * 1e6
    with Timer() as t1:
        for i in range(10000):
            fast_sum_arr(vec1, z)
    print "time for fast_sum_arr  [µs]:   ", (t1.secs-t0.secs)  / 10000 * 1e6
    with Timer() as t1:
        for i in range(10000):
            sum_sclr(vec1)
    print "time for sum_arr  [µs]:   ", (t1.secs-t0.secs)  / 10000 * 1e6
    with Timer() as t1:
        for i in range(10000):
            fast_sum_sclr(vec1)
    print "time for fast_sum_arr  [µs]:   ", (t1.secs-t0.secs)  / 10000 * 1e6

"""
time for empty loop  0.000312089920044

time for sum_arr       [µs]:    432.02688694
time for fast_sum_arr  [µs]:      7.43598937988
time for sum_arr       [µs]:    284.574580193
time for fast_sum_arr  [µs]:      5.74610233307
"""
like image 138
Uwe Fechner Avatar answered Oct 27 '22 12:10

Uwe Fechner


I don't know much about numba but if we make some basic assumptions about what it's doing under the hood we can infer why the autojit version is slower and how to speed it up with minor changes...

Let's start with the sum_arr,

1 def sum_arr(arr):
2     z = arr.copy()
3     M = len(arr)
4     for i in range(M):
5         z[i] += arr[i]
6 
7     return z

Pretty clear what's going on here, but let's pick about line 5 which can be rewritten as

1 a = arr[i]
2 b = z[i]
3 c = a + b
4 z[i] = c

Python will further inturpet this as

1 a = arr.__getitem__(i)
2 b = arr.__getitem__(i) 
3 c = a.__add__(b)
4 z.__setitem__(i, c)

a,b and c are all instances of numpy.int64 (or similar)

I suspect that numba is trying to inspect the date type of these items and convert them into some numba native datatypes (one of the biggest slow downs I see with numpy code is inadvertently switching from python datatypes to numpy datatypes). If this is indeed what's going on, numba is doing at least 3 conversions, 2 numpy.int64 -> native, 1 native -> numpy.int64, or probably worse with intermediates (numpy.int64 -> python int -> native (c int)). I suspect numba would add extra overhead in checking datatypes, probably not optimize the loop at all. Let's see what happens if we remove the type change from the loop...

1 @autojit
2 def fast_sum_arr2(arr):
3     z = arr.tolist()
4     M = len(arr)
5     for i in range(M):
6         z[i] += arr[i]
7 
8     return numpy.array(z)

The subtle change on line 3, tolist instead of copy, changes the datatype to Python ints, but we still have a numpy.int64 -> native on line 6. Let's rewrite that to, z[i] += z[i]

1 @autojit
2 def fast_sum_arr3(arr):
3     z = arr.tolist()
4     M = len(arr)
5     for i in range(M):
6         z[i] += z[i]
7 
8     return numpy.array(z)

With all the changes we see a pretty substantial speedup (although it doesn't necessarily beat pure python). Of course, arr+arr, is just stupid fast.

  1 import numpy
  2 from numba import autojit
  3 
  4 def sum_arr(arr):
  5     z = arr.copy()
  6     M = len(arr)
  7     for i in range(M):
  8         z[i] += arr[i]
  9 
 10     return z
 11 
 12 @autojit
 13 def fast_sum_arr(arr):
 14     z = arr.copy()
 15     M = len(arr)
 16     for i in range(M):
 17         z[i] += arr[i]
 18     
 19     return z
 20 
 21 def sum_arr2(arr):
 22     z = arr.tolist()
 23     M = len(arr)
 24     for i in range(M):
 25         z[i] += arr[i]
 26 
 27     return numpy.array(z)
 28 
 29 @autojit
 30 def fast_sum_arr2(arr):
 31     z = arr.tolist()
 32     M = len(arr)
 33     for i in range(M):
 34         z[i] += arr[i]
 35         
 36     return numpy.array(z)
 37     
 38 def sum_arr3(arr):
 39     z = arr.tolist()
 40     M = len(arr)
 41     for i in range(M):
 42         z[i] += z[i]
 43         
 44     return numpy.array(z)
 45 
 46 @autojit
 47 def fast_sum_arr3(arr):
 48     z = arr.tolist()
 49     M = len(arr)
 50     for i in range(M):
 51         z[i] += z[i]
 52 
 53     return numpy.array(z)
 54 
 55 def sum_arr4(arr):
 56     return arr+arr
 57 
 58 @autojit
 59 def fast_sum_arr4(arr):
 60     return arr+arr
 61 
 62 arr = numpy.arange(1000)

And the timings,

In [1]: %timeit sum_arr(arr)
10000 loops, best of 3: 129 us per loop

In [2]: %timeit sum_arr2(arr)
1000 loops, best of 3: 232 us per loop

In [3]: %timeit sum_arr3(arr)
10000 loops, best of 3: 51.8 us per loop

In [4]: %timeit sum_arr4(arr)
100000 loops, best of 3: 3.68 us per loop

In [5]: %timeit fast_sum_arr(arr)
1000 loops, best of 3: 216 us per loop

In [6]: %timeit fast_sum_arr2(arr)
10000 loops, best of 3: 65.6 us per loop

In [7]: %timeit fast_sum_arr3(arr)
10000 loops, best of 3: 56.5 us per loop

In [8]: %timeit fast_sum_arr4(arr)
100000 loops, best of 3: 2.03 us per loop
like image 26
Charles Avatar answered Oct 27 '22 10:10

Charles