Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improving performance of operations on a NumPy array

Tags:

python

numpy

I'm using numpy.delete to remove elements from an array that is inside a while loop. This while loop is valid only if the array is not empty. This code works fine but slows down considerably when the array has over 1e6 elements. Here is an example:

while(array.shape[0] > 0):
     ix = where((array >= x) & (array <= y))[0]
     array = delete(array,ix,None)

I've tried to make this code efficient but I cannot find a good way to speed up the while loop. The bottleneck here is, I think, the delete which must involve a copy of some kind. I've tried using masked array in order to avoid copying but I'm not that good at python and masked array are not that easy to search. Is there a good and fast way to use delete or replace it so that 7e6 elements can be handled by the loop above without taking 24 hours?

Thanks

like image 428
Shejo284 Avatar asked May 14 '12 19:05

Shejo284


People also ask

How can I make NumPy array faster?

The key to making it fast is to use vectorized operations, generally implemented through NumPy's universal functions (ufuncs). This section motivates the need for NumPy's ufuncs, which can be used to make repeated calculations on array elements much more efficient.

Why NumPy array operations are faster?

NumPy Arrays are faster than Python Lists because of the following reasons: An array is a collection of homogeneous data-types that are stored in contiguous memory locations. On the other hand, a list in Python is a collection of heterogeneous data types stored in non-contiguous memory locations.

Are NumPy operations faster?

NumPy is fast because it can do all its calculations without calling back into Python. Since this function involves looping in Python, we lose all the performance benefits of using NumPy. For a 10,000,000-entry NumPy array, this functions takes 2.5 seconds to run on my computer.

How does NumPy optimize?

NumPy allows arrays to only have a single data type and stores the data internally in a contiguous block of memory. Taking advantage of this fact, NumPy delegates most of the operations on such arrays to optimized, pre-compiled C code under the hood.


1 Answers

So you can substantially improve the performance of your code by:

  • eliminating the loop; and

  • avoiding the delete operations (which cause a copy of the original array)

NumPy 1.7 introduced a new mask that is far easier to use than the original; it's performance is also much better because it's part of the NumPy core array object. I think this might be useful to you because by using it you can avoid the expensive delete operation.

In other words, instead of deleting the array elements you don't want, just mask them. This has been suggested in other Answers, but i am suggesting to use the new mask

to use NA, just import NA

>>> from numpy import NA as NA

then for a given array, set the maskna flag to True

>>> A.flags.maskna = True

Alternatively, most array constructors (as of 1.7) have the parameter maskna, which you can set to True

>>> A[3,3] = NA

array([[7, 5, 4, 8, 4],
       [2, 4, 3, 7, 3],
       [3, 1, 3, 2, 1],
       [8, 2, 0, NA, 7],
       [0, 7, 2, 5, 5],
       [5, 4, 2, 7, 4],
       [1, 2, 9, 2, 3],
       [7, 5, 1, 2, 9]])

>>> A.sum(axis=0)
array([33, 30, 24, NA, 36])

Often this is not what you want--i.e., you still want the sum of that column with the NA treated as if it were 0:

To get that behavior, pass in True for the skipma parameter (most NumPy array constructors have this parameter in NumPy 1.7):

>>> A.sum(axis=0, skipna=True)
array([33, 30, 24, 33, 36])

In sum, to speed up your code, eliminate the loop and use the new mask:

>>> A[(A<=3)&(A<=6)] = NA

>>> A
array([[8, 8, 4, NA, NA],
       [7, 9, NA, NA, 8],
       [NA, 6, 9, 5, NA],
       [9, 4, 6, 6, 5],
       [NA, 6, 8, NA, NA],
       [8, 5, 7, 7, NA],
       [NA, 4, 5, 9, 9],
       [NA, 8, NA, 5, 9]])

The NA placeholders--in this context--behave like 0s, which i believe is what you want:

>>> A.sum(axis=0, skipna=True)
array([32, 50, 39, 32, 31])
like image 179
doug Avatar answered Nov 15 '22 20:11

doug