Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is numpy.any so slow over large arrays?

I'm looking for the most efficient way to determine whether a large array contains at least one nonzero value. At first glance np.any seems like the obvious tool for the job, but it seems unexpectedly slow over large arrays.

Consider this extreme case:

first = np.zeros(1E3,dtype=np.bool) last = np.zeros(1E3,dtype=np.bool)  first[0] = True last[-1] = True  # test 1 %timeit np.any(first) >>> 100000 loops, best of 3: 6.36 us per loop  # test 2 %timeit np.any(last) >>> 100000 loops, best of 3: 6.95 us per loop 

At least np.any seems to be doing something vaguely sensible here - if the nonzero value is the first one in the array, there should be no need to consider any others before returning True, so I would expect test 1 to be slightly quicker than test 2.

However, what happens when we make the arrays much larger?

first = np.zeros(1E9,dtype=np.bool) last = np.zeros(1E9,dtype=np.bool)  first[0] = True last[-1] = True  # test 3 %timeit np.any(first) >>> 10 loops, best of 3: 21.6 ms per loop  # test 4 %timeit np.any(last) >>> 1 loops, best of 3: 739 ms per loop 

As expected, test 4 is quite a lot slower than test 3. However, in test 3 np.any should still only have to check the value of a single element in first in order to know that it contains at least one nonzero value. Why, then, is test 3 so much slower than test 1?

Edit 1:

I'm using a development version of Numpy (1.8.0.dev-e11cd9b), but I get the exact same timing results using Numpy 1.7.1. I'm running 64 bit Linux, Python 2.7.4. My system is basically idling (I'm running an IPython session, a browser and a text editor), and I'm definitely not hitting the swap. I also replicated the result on another machine running Numpy 1.7.1.

Edit 2:

Using Numpy 1.6.2 I get times of ~1.85us for both tests 1 and 3, so as jorgeca says there seems to have been some performance regression between Numpy 1.6.2 and 1.7.1 1.7.0 in this regard.

Edit 3:

Following J.F. Sebastian and jorgeca's lead I've done some more benchmarking using np.all on an array of zeros, which ought to be equivalent to calling np.any on an array where the first element is one.

Test script:

import timeit import numpy as np print 'Numpy v%s' %np.version.full_version stmt = "np.all(x)" for ii in xrange(10):     setup = "import numpy as np; x = np.zeros(%d,dtype=np.bool)" %(10**ii)     timer = timeit.Timer(stmt,setup)     n,r = 1,3     t = np.min(timer.repeat(r,n))     while t < 0.2:         n *= 10         t = np.min(timer.repeat(r,n))     t /= n     if t < 1E-3:         timestr = "%1.3f us" %(t*1E6)     elif t < 1:         timestr = "%1.3f ms" %(t*1E3)     else:         timestr = "%1.3f s" %t     print "Array size: 1E%i, %i loops, best of %i: %s/loop" %(ii,n,r,timestr) 

Results:

Numpy v1.6.2 Array size: 1E0, 1000000 loops, best of 3: 1.738 us/loop Array size: 1E1, 1000000 loops, best of 3: 1.845 us/loop Array size: 1E2, 1000000 loops, best of 3: 1.862 us/loop Array size: 1E3, 1000000 loops, best of 3: 1.858 us/loop Array size: 1E4, 1000000 loops, best of 3: 1.864 us/loop Array size: 1E5, 1000000 loops, best of 3: 1.882 us/loop Array size: 1E6, 1000000 loops, best of 3: 1.866 us/loop Array size: 1E7, 1000000 loops, best of 3: 1.853 us/loop Array size: 1E8, 1000000 loops, best of 3: 1.860 us/loop Array size: 1E9, 1000000 loops, best of 3: 1.854 us/loop  Numpy v1.7.0 Array size: 1E0, 100000 loops, best of 3: 5.881 us/loop Array size: 1E1, 100000 loops, best of 3: 5.831 us/loop Array size: 1E2, 100000 loops, best of 3: 5.924 us/loop Array size: 1E3, 100000 loops, best of 3: 5.864 us/loop Array size: 1E4, 100000 loops, best of 3: 5.997 us/loop Array size: 1E5, 100000 loops, best of 3: 6.979 us/loop Array size: 1E6, 100000 loops, best of 3: 17.196 us/loop Array size: 1E7, 10000 loops, best of 3: 116.162 us/loop Array size: 1E8, 1000 loops, best of 3: 1.112 ms/loop Array size: 1E9, 100 loops, best of 3: 11.061 ms/loop  Numpy v1.7.1 Array size: 1E0, 100000 loops, best of 3: 6.216 us/loop Array size: 1E1, 100000 loops, best of 3: 6.257 us/loop Array size: 1E2, 100000 loops, best of 3: 6.318 us/loop Array size: 1E3, 100000 loops, best of 3: 6.247 us/loop Array size: 1E4, 100000 loops, best of 3: 6.492 us/loop Array size: 1E5, 100000 loops, best of 3: 7.406 us/loop Array size: 1E6, 100000 loops, best of 3: 17.426 us/loop Array size: 1E7, 10000 loops, best of 3: 115.946 us/loop Array size: 1E8, 1000 loops, best of 3: 1.102 ms/loop Array size: 1E9, 100 loops, best of 3: 10.987 ms/loop  Numpy v1.8.0.dev-e11cd9b Array size: 1E0, 100000 loops, best of 3: 6.357 us/loop Array size: 1E1, 100000 loops, best of 3: 6.399 us/loop Array size: 1E2, 100000 loops, best of 3: 6.425 us/loop Array size: 1E3, 100000 loops, best of 3: 6.397 us/loop Array size: 1E4, 100000 loops, best of 3: 6.596 us/loop Array size: 1E5, 100000 loops, best of 3: 7.569 us/loop Array size: 1E6, 100000 loops, best of 3: 17.445 us/loop Array size: 1E7, 10000 loops, best of 3: 115.109 us/loop Array size: 1E8, 1000 loops, best of 3: 1.094 ms/loop Array size: 1E9, 100 loops, best of 3: 10.840 ms/loop 

Edit 4:

Following seberg's comment I tried the same test with an np.float32 array instead of np.bool. In this case, Numpy 1.6.2 also shows a slowdown as array sizes increase:

Numpy v1.6.2 Array size: 1E0, 100000 loops, best of 3: 3.503 us/loop Array size: 1E1, 100000 loops, best of 3: 3.597 us/loop Array size: 1E2, 100000 loops, best of 3: 3.742 us/loop Array size: 1E3, 100000 loops, best of 3: 4.745 us/loop Array size: 1E4, 100000 loops, best of 3: 14.533 us/loop Array size: 1E5, 10000 loops, best of 3: 112.463 us/loop Array size: 1E6, 1000 loops, best of 3: 1.101 ms/loop Array size: 1E7, 100 loops, best of 3: 11.724 ms/loop Array size: 1E8, 10 loops, best of 3: 116.924 ms/loop Array size: 1E9, 1 loops, best of 3: 1.168 s/loop  Numpy v1.7.1 Array size: 1E0, 100000 loops, best of 3: 6.548 us/loop Array size: 1E1, 100000 loops, best of 3: 6.546 us/loop Array size: 1E2, 100000 loops, best of 3: 6.804 us/loop Array size: 1E3, 100000 loops, best of 3: 7.784 us/loop Array size: 1E4, 100000 loops, best of 3: 17.946 us/loop Array size: 1E5, 10000 loops, best of 3: 117.235 us/loop Array size: 1E6, 1000 loops, best of 3: 1.096 ms/loop Array size: 1E7, 100 loops, best of 3: 12.328 ms/loop Array size: 1E8, 10 loops, best of 3: 118.431 ms/loop Array size: 1E9, 1 loops, best of 3: 1.172 s/loop 

Why should this happen? As with the boolean case, np.all should still only have to check the first element before returning, so times should still be constant w.r.t. array size.

like image 719
ali_m Avatar asked Jun 15 '13 21:06

ali_m


People also ask

Is NumPy faster than array?

As array size gets close to 5,000,000, Numpy gets around 120 times faster. As the array size increases, Numpy is able to execute more parallel operations and making computation faster.

Are NumPy arrays slow?

The reason why NumPy is fast when used right is that its arrays are extremely efficient. They are like C arrays instead of Python lists.

Is NumPy array slower than list?

NumPy Arrays Are Faster Than Lists The array is randomly generated. As predicted, we can see that NumPy arrays are significantly faster than lists.

What is faster than NumPy?

pandas provides a bunch of C or Cython optimized functions that can be faster than the NumPy equivalent function (e.g. reading text from text files). If you want to do mathematical operations like a dot product, calculating mean, and some more, pandas DataFrames are generally going to be slower than a NumPy array.


1 Answers

As has been guessed in the comments, I can confirm that the processing of the array is being done in chunks. First, I will show you where things are in the code and then I will show you how you can change the chunk size and the effect that doing so has on your benchmark.

Where to find the reduction processing in the Numpy source files

np.all(x) is the same as x.all(). all() really calls np.core.umath.logical_and.reduce(x).

If you want to dig into the numpy source, I will try to guide you through finding that a buffer/chunk size is used. The folder with all of the code we will be looking at is numpy/core/src/umath/.

PyUFunc_Reduce() in ufunc_object.c is the C function that handles the reduce. In PyUFunc_Reduce(), the chunk, or buffer, size is found by looking up the value for reduce in some global dictionary via the PyUFunc_GetPyValues() function (ufunc_object.c). On my machine and compiling from the development branch, the chunk size is 8192. PyUFunc_ReduceWrapper() in reduction.c is called to set-up the iterator (with a stride equal to the chunk size) and it calls the passed in loop function which is reduce_loop() in ufunc_object.c.

reduce_loop() basically just uses the iterator and calls another innerloop() function for each chunk. The innerloop function is found in loops.c.src. For a boolean array and our case of all/logical_and, the appropriate innerloop function is BOOL_logical_and. You can find the right function by searching for BOOLEAN LOOPS and then it is the second function below that (it is hard to find due to the template-like programming used here). There you will find that short circuiting is in fact being done for each chunk.

How to change the buffer size used in ufunctions (and thus in any/all)

You can get the chunk/buffer size with np.getbuffersize(). For me, that returns 8192 without manually setting it which matches what I found by printing out the buffer size in the code. You can use np.setbuffersize() to change the chunk size.

Results using a bigger buffer size

I changed your benchmark code to the following:

import timeit import numpy as np print 'Numpy v%s' %np.version.full_version stmt = "np.all(x)" for ii in xrange(9):     setup = "import numpy as np; x = np.zeros(%d,dtype=np.bool); np.setbufsize(%d)" %(10**ii, max(8192, min(10**ii, 10**7)))     timer = timeit.Timer(stmt,setup)     n,r = 1,3     t = np.min(timer.repeat(r,n))     while t < 0.2:         n *= 10         t = np.min(timer.repeat(r,n))     t /= n     if t < 1E-3:         timestr = "%1.3f us" %(t*1E6)     elif t < 1:         timestr = "%1.3f ms" %(t*1E3)     else:         timestr = "%1.3f s" %t     print "Array size: 1E%i, %i loops, best of %i: %s/loop" %(ii,n,r,timestr) 

Numpy doesn't like the buffer size being too small or too big so I made sure that it didn't get smaller than 8192 or larger than 1E7 because Numpy didn't like a buffer size of 1E8. Otherwise, I was setting the buffer size to the size of the array being processed. I only went up to 1E8 because my machine only has 4GB of memory at the moment. Here are the results:

Numpy v1.8.0.dev-2a5c2c8 Array size: 1E0, 100000 loops, best of 3: 5.351 us/loop Array size: 1E1, 100000 loops, best of 3: 5.390 us/loop Array size: 1E2, 100000 loops, best of 3: 5.366 us/loop Array size: 1E3, 100000 loops, best of 3: 5.360 us/loop Array size: 1E4, 100000 loops, best of 3: 5.433 us/loop Array size: 1E5, 100000 loops, best of 3: 5.400 us/loop Array size: 1E6, 100000 loops, best of 3: 5.397 us/loop Array size: 1E7, 100000 loops, best of 3: 5.381 us/loop Array size: 1E8, 100000 loops, best of 3: 6.126 us/loop 

There is a small uptick in the last timing because there are multiple chunks being processed due to the limitations on how big the buffer size can be.

like image 174
Justin Peel Avatar answered Oct 19 '22 05:10

Justin Peel