Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is NumPy sometimes slower than NumPy + plain Python loop?

This is based on this question asked 2018-10.

Consider the following code. Three simple functions to count non-zero elements in a NumPy 3D array (1000 × 1000 × 1000).

import numpy as np

def f_1(arr):
    return np.sum(arr > 0)

def f_2(arr):
    ans = 0
    for val in range(arr.shape[0]):
        ans += np.sum(arr[val, :, :] > 0)
    return ans

def f_3(arr):
    return np.count_nonzero(arr)

if __name__ == '__main__':

    data = np.random.randint(0, 10, (1_000, 1_000, 1_000))
    print(f_1(data))
    print(f_2(data))
    print(f_3(data))

Runtimes on my machine (Python 3.7.?, Windows 10, NumPy 1.16.?):

%timeit f_1(data)
1.73 s ± 21.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit f_2(data)
1.4 s ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit f_3(data)
2.38 s ± 956 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

So, f_2() works faster than f_1() and f_3(). However, it's not the case with data of smaller size. The question is - why so? Is it NumPy, Python, or something else?

like image 447
Poolka Avatar asked Jun 28 '19 12:06

Poolka


People also ask

Why are NumPy functions faster than loops?

NumPy is fast because it can do all its calculations without calling back into Python. Since this function involves looping in Python, we lose all the performance benefits of using NumPy. For a 10,000,000-entry NumPy array, this functions takes 2.5 seconds to run on my computer.

Is NumPy faster than loop?

For math on vectors numpy is drastically faster than working with for-loops, but this too depends on the specifics of the task. Is Python NumPy better than lists?

How do I speed up NumPy in Python?

By explicitly declaring the "ndarray" data type, your array processing can be 1250x faster. This tutorial will show you how to speed up the processing of NumPy arrays using Cython. By explicitly specifying the data types of variables in Python, Cython can give drastic speed increases at runtime.

Is NumPy Broadcasting slow?

Broadcasting is usually fast, since it vectorizes array operations so that looping occurs in optimized C code instead of the slower Python. In addition, it doesn't really require storing all copies of the smaller array; instead, there are faster and more efficient algorithms to store that.


2 Answers

This is due to memory access and caching. Each of these functions is doing two things, taking the first code as an example:

np.sum(arr > 0)

It first does a comparison to find where arr is greater than zero (or non-zero, since arr contains non-negative integers). This creates an intermediate array the same shape as arr. Then, it sums this array.

Straightforward, right? Well, when using np.sum(arr > 0) this is a large array. When it's large enough to not fit in cache, performance will decrease since when the processor starts to execute the sum most of the array elements will have been evicted from memory and need to be reloaded.

Since f_2 iterates over the first dimension, it is dealing with smaller sub-arrays. The same copy and sum is done, but this time the intermediate array fits in memory. It's created, used, and destroyed without ever leaving memory. This is much faster.

Now, you would think that f_3 would be fastest (using an in-built method and all), but looking at the source code shows that it uses the following operations:

a_bool = a.astype(np.bool_, copy=False)
return a_bool.sum(axis=axis, dtype=np.intp

a_bool is just another way of finding the non-zero entries, and creates a large intermediate array.

Conclusions

Rules of thumb are just that, and are frequently wrong. If you want faster code, profile it and see what the problems are (good work on that here).

Python does some things very well. In cases where it's optimized, it can be faster than numpy. Don't be afraid to use plain old python code or datatypes in combination with numpy.

If you find frequently yourself manually writing for loops for better performance you may want to take a look at numexpr - it automatically does some of this. I haven't used it much myself, but it should provide a good speedup if intermediate arrays are what's slowing down your program.

like image 75
user2699 Avatar answered Oct 08 '22 16:10

user2699


It's all a matter of how the data is laid out in memory and how the code accesses it. Essentially, data is fetched from the memory in blocks which are then cached; if an algorithm manages to use data from a block that is in the cache, there is no need to read from memory again. This can result in huge time savings, especially when the cache is much smaller than the data you are dealing with.

Consider these variations, which only differ in which axis we are iterating on:

def f_2_0(arr):
    ans = 0
    for val in range(arr.shape[0]):
        ans += np.sum(arr[val, :, :] > 0)
    return ans

def f_2_1(arr):
    ans = 0
    for val in range(arr.shape[1]):
        ans += np.sum(arr[:, val, :] > 0)
    return ans

def f_2_2(arr):
    ans = 0
    for val in range(arr.shape[2]):
        ans += np.sum(arr[:, :, val] > 0)
    return ans

And the results on my laptop:

%timeit f_1(data)
2.31 s ± 47.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit f_2_0(data)
1.88 s ± 60 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit f_2_1(data)
2.65 s ± 142 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit f_2_2(data)
12.8 s ± 650 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can see that f_2_1 almost as fast as f_1, which makes me think that numpy is not using the optimal access pattern (the one used by f_2_0). The explanation for how exactly caching affects the timing is in the other answer.

like image 5
BlackBear Avatar answered Oct 08 '22 17:10

BlackBear