This is based on this question asked 2018-10. Consider the following code. Three simple functions to count non-zero elements in a NumPy 3D array (1000 × 1000 × 1000). <pre class="prettyprint"><code>import numpy as np def f_1(arr): return np.sum(arr > 0) def f_2(arr): ans = 0 for val in range(arr.shape[0]): ans += np.sum(arr[val, :, :] > 0) return ans def f_3(arr): return np.count_nonzero(arr) if __name__ == '__main__': data = np.random.randint(0, 10, (1_000, 1_000, 1_000)) print(f_1(data)) print(f_2(data)) print(f_3(data)) </code></pre> Runtimes on my machine (Python 3.7.?, Windows 10, NumPy 1.16.?): <pre class="prettyprint"><code>%timeit f_1(data) 1.73 s ± 21.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit f_2(data) 1.4 s ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit f_3(data) 2.38 s ± 956 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) </code></pre> So, <code>f_2()</code> works faster than <code>f_1()</code> and <code>f_3()</code>. However, it's not the case with <code>data</code> of smaller size. The question is - why so? Is it NumPy, Python, or something else?

This is due to memory access and caching. Each of these functions is doing two things, taking the first code as an example: <pre class="prettyprint"><code>np.sum(arr > 0) </code></pre> It first does a comparison to find where <code>arr</code> is greater than zero (or non-zero, since <code>arr</code> contains non-negative integers). This creates an intermediate array the same shape as <code>arr</code>. Then, it sums this array. Straightforward, right? Well, when using <code>np.sum(arr > 0)</code> this is a large array. When it's large enough to not fit in cache, performance will decrease since when the processor starts to execute the sum most of the array elements will have been evicted from memory and need to be reloaded. Since <code>f_2</code> iterates over the first dimension, it is dealing with smaller sub-arrays. The same copy and sum is done, but this time the intermediate array fits in memory. It's created, used, and destroyed without ever leaving memory. This is much faster. Now, you would think that <code>f_3</code> would be fastest (using an in-built method and all), but looking at the source code shows that it uses the following operations: <pre class="prettyprint"><code>a_bool = a.astype(np.bool_, copy=False) return a_bool.sum(axis=axis, dtype=np.intp </code></pre> <code>a_bool</code> is just another way of finding the non-zero entries, and creates a large intermediate array. Conclusions Rules of thumb are just that, and are frequently wrong. If you want faster code, profile it and see what the problems are (good work on that here). <code>Python</code> does some things very well. In cases where it's optimized, it can be faster than <code>numpy</code>. Don't be afraid to use plain old python code or datatypes in combination with numpy. If you find frequently yourself manually writing for loops for better performance you may want to take a look at <code>numexpr</code> - it automatically does some of this. I haven't used it much myself, but it should provide a good speedup if intermediate arrays are what's slowing down your program.

Why is NumPy sometimes slower than NumPy + plain Python loop?

Tags:

performance

python

numpy

This is based on this question asked 2018-10.

Consider the following code. Three simple functions to count non-zero elements in a NumPy 3D array (1000 × 1000 × 1000).

import numpy as np

def f_1(arr):
    return np.sum(arr > 0)

def f_2(arr):
    ans = 0
    for val in range(arr.shape[0]):
        ans += np.sum(arr[val, :, :] > 0)
    return ans

def f_3(arr):
    return np.count_nonzero(arr)

if __name__ == '__main__':

    data = np.random.randint(0, 10, (1_000, 1_000, 1_000))
    print(f_1(data))
    print(f_2(data))
    print(f_3(data))

Runtimes on my machine (Python 3.7.?, Windows 10, NumPy 1.16.?):

%timeit f_1(data)
1.73 s ± 21.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit f_2(data)
1.4 s ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit f_3(data)
2.38 s ± 956 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

So, f_2() works faster than f_1() and f_3(). However, it's not the case with data of smaller size. The question is - why so? Is it NumPy, Python, or something else?

447

asked Jun 28 '19 12:06

Poolka

2 Answers

This is due to memory access and caching. Each of these functions is doing two things, taking the first code as an example:

np.sum(arr > 0)

It first does a comparison to find where arr is greater than zero (or non-zero, since arr contains non-negative integers). This creates an intermediate array the same shape as arr. Then, it sums this array.

Straightforward, right? Well, when using np.sum(arr > 0) this is a large array. When it's large enough to not fit in cache, performance will decrease since when the processor starts to execute the sum most of the array elements will have been evicted from memory and need to be reloaded.

Since f_2 iterates over the first dimension, it is dealing with smaller sub-arrays. The same copy and sum is done, but this time the intermediate array fits in memory. It's created, used, and destroyed without ever leaving memory. This is much faster.

Now, you would think that f_3 would be fastest (using an in-built method and all), but looking at the source code shows that it uses the following operations:

a_bool = a.astype(np.bool_, copy=False)
return a_bool.sum(axis=axis, dtype=np.intp

a_bool is just another way of finding the non-zero entries, and creates a large intermediate array.

Conclusions

Rules of thumb are just that, and are frequently wrong. If you want faster code, profile it and see what the problems are (good work on that here).

Python does some things very well. In cases where it's optimized, it can be faster than numpy. Don't be afraid to use plain old python code or datatypes in combination with numpy.

If you find frequently yourself manually writing for loops for better performance you may want to take a look at numexpr - it automatically does some of this. I haven't used it much myself, but it should provide a good speedup if intermediate arrays are what's slowing down your program.

answered Oct 08 '22 16:10

user2699

It's all a matter of how the data is laid out in memory and how the code accesses it. Essentially, data is fetched from the memory in blocks which are then cached; if an algorithm manages to use data from a block that is in the cache, there is no need to read from memory again. This can result in huge time savings, especially when the cache is much smaller than the data you are dealing with.

Consider these variations, which only differ in which axis we are iterating on:

def f_2_0(arr):
    ans = 0
    for val in range(arr.shape[0]):
        ans += np.sum(arr[val, :, :] > 0)
    return ans

def f_2_1(arr):
    ans = 0
    for val in range(arr.shape[1]):
        ans += np.sum(arr[:, val, :] > 0)
    return ans

def f_2_2(arr):
    ans = 0
    for val in range(arr.shape[2]):
        ans += np.sum(arr[:, :, val] > 0)
    return ans

And the results on my laptop:

%timeit f_1(data)
2.31 s ± 47.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit f_2_0(data)
1.88 s ± 60 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit f_2_1(data)
2.65 s ± 142 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit f_2_2(data)
12.8 s ± 650 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can see that f_2_1 almost as fast as f_1, ~~which makes me think that numpy is not using the optimal access pattern (the one used by f_2_0)~~. The explanation for how exactly caching affects the timing is in the other answer.

answered Oct 08 '22 17:10

BlackBear

Related questions
                            
                                K.gradients(loss, input_img)[0] return "None". (Keras CNN visualization with tensorflow backend)
                            
                                Does using scrapy-splash significantly affect scraping speed? [closed]
                            
                                pandas read sql db2 corrupts decimal
                            
                                Remove Minutes and Hours from Series
                            
                                How to create mask images from COCO dataset?
                            
                                Tensorflow InvalidArgumentError (indices) while training with Keras
                            
                                Plotting two histograms from a pandas DataFrame in one subplot using matplotlib
                            
                                Plot importance variables xgboost Python
                            
                                pandas groupby aggregate element-wise list addition
                            
                                how to connect to region in boto3
                            
                                Change color of missing values in Seaborn heatmap
                            
                                Return two data frames from a function with data frame format
                            
                                How to write CUSTOM metadata into JPEG with Python?
                            
                                Gunicorn won't start Flask app because "Application object must be callable"
                            
                                Downloading dynamically generated files from a Dash/Flask app
                            
                                Provide a path to gdal-config using a GDAL_CONFIG environment variable error while attempting to install Fiona
                            
                                Pandas group the rows in a dataframe based on specific column value
                            
                                How to sort a set in python? [duplicate]
                            
                                What is the difference between pywin32 and pypiwin32?
                            
                                Python Pandas Expand a Column of List of Lists to Two New Column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With