This is based on this question asked 2018-10.
Consider the following code. Three simple functions to count non-zero elements in a NumPy 3D array (1000 × 1000 × 1000).
import numpy as np
def f_1(arr):
return np.sum(arr > 0)
def f_2(arr):
ans = 0
for val in range(arr.shape[0]):
ans += np.sum(arr[val, :, :] > 0)
return ans
def f_3(arr):
return np.count_nonzero(arr)
if __name__ == '__main__':
data = np.random.randint(0, 10, (1_000, 1_000, 1_000))
print(f_1(data))
print(f_2(data))
print(f_3(data))
Runtimes on my machine (Python 3.7.?, Windows 10, NumPy 1.16.?):
%timeit f_1(data)
1.73 s ± 21.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_2(data)
1.4 s ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_3(data)
2.38 s ± 956 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
So, f_2()
works faster than f_1()
and f_3()
. However, it's not the case with data
of smaller size. The question is - why so? Is it NumPy, Python, or something else?
NumPy is fast because it can do all its calculations without calling back into Python. Since this function involves looping in Python, we lose all the performance benefits of using NumPy. For a 10,000,000-entry NumPy array, this functions takes 2.5 seconds to run on my computer.
For math on vectors numpy is drastically faster than working with for-loops, but this too depends on the specifics of the task. Is Python NumPy better than lists?
By explicitly declaring the "ndarray" data type, your array processing can be 1250x faster. This tutorial will show you how to speed up the processing of NumPy arrays using Cython. By explicitly specifying the data types of variables in Python, Cython can give drastic speed increases at runtime.
Broadcasting is usually fast, since it vectorizes array operations so that looping occurs in optimized C code instead of the slower Python. In addition, it doesn't really require storing all copies of the smaller array; instead, there are faster and more efficient algorithms to store that.
This is due to memory access and caching. Each of these functions is doing two things, taking the first code as an example:
np.sum(arr > 0)
It first does a comparison to find where arr
is greater than zero (or non-zero, since arr
contains non-negative integers). This creates an intermediate array the same shape as arr
. Then, it sums this array.
Straightforward, right? Well, when using np.sum(arr > 0)
this is a large array. When it's large enough to not fit in cache, performance will decrease since when the processor starts to execute the sum most of the array elements will have been evicted from memory and need to be reloaded.
Since f_2
iterates over the first dimension, it is dealing with smaller sub-arrays. The same copy and sum is done, but this time the intermediate array fits in memory. It's created, used, and destroyed without ever leaving memory. This is much faster.
Now, you would think that f_3
would be fastest (using an in-built method and all), but looking at the source code shows that it uses the following operations:
a_bool = a.astype(np.bool_, copy=False)
return a_bool.sum(axis=axis, dtype=np.intp
a_bool
is just another way of finding the non-zero entries, and creates a large intermediate array.
Conclusions
Rules of thumb are just that, and are frequently wrong. If you want faster code, profile it and see what the problems are (good work on that here).
Python
does some things very well. In cases where it's optimized, it can be faster than numpy
. Don't be afraid to use plain old python code or datatypes in combination with numpy.
If you find frequently yourself manually writing for loops for better performance you may want to take a look at numexpr
- it automatically does some of this. I haven't used it much myself, but it should provide a good speedup if intermediate arrays are what's slowing down your program.
It's all a matter of how the data is laid out in memory and how the code accesses it. Essentially, data is fetched from the memory in blocks which are then cached; if an algorithm manages to use data from a block that is in the cache, there is no need to read from memory again. This can result in huge time savings, especially when the cache is much smaller than the data you are dealing with.
Consider these variations, which only differ in which axis we are iterating on:
def f_2_0(arr):
ans = 0
for val in range(arr.shape[0]):
ans += np.sum(arr[val, :, :] > 0)
return ans
def f_2_1(arr):
ans = 0
for val in range(arr.shape[1]):
ans += np.sum(arr[:, val, :] > 0)
return ans
def f_2_2(arr):
ans = 0
for val in range(arr.shape[2]):
ans += np.sum(arr[:, :, val] > 0)
return ans
And the results on my laptop:
%timeit f_1(data)
2.31 s ± 47.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_2_0(data)
1.88 s ± 60 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_2_1(data)
2.65 s ± 142 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_2_2(data)
12.8 s ± 650 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can see that f_2_1
almost as fast as f_1
, which makes me think that numpy is not using the optimal access pattern (the one used by . The explanation for how exactly caching affects the timing is in the other answer.f_2_0
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With