boolean indexing vs np.where

Question

Ok let's say I have a numpy array arr and a boolean array mask of the same shape (for example mask = arr >= 20)

I want an array containing all values of arr where mask is True. I don't really care about the order (I am going to take the sum of this afterwards)

From what I gather from the numpy doc, I can just use boolean indexing :

arr[mask]

Nethertheless, on the internet, I saw a lot of code along the lines of :

arr[np.where(mask)]

Which, I think, does the same, but using index arrays.

Do these two lines really do the same thing ? and if so, is one of them faster ?

ojdo · Accepted Answer

As for performance: why not simply measure it? Have a simple example:

In [11]: y = np.arange(35).reshape(5,7)

In [12]: mask = (y % 2 == 0)

In [13]: mask
Out[13]:
array([[ True, False,  True, False,  True, False,  True],
       [False,  True, False,  True, False,  True, False],
       [ True, False,  True, False,  True, False,  True],
       [False,  True, False,  True, False,  True, False],
       [ True, False,  True, False,  True, False,  True]])

Then %timeit:

In [14]: %timeit y[mask]
534 ns ± 1.61 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [15]: %timeit y[np.where(mask)]
2.18 µs ± 16.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Unsurprisingly - even if there were no functional differences between both lines - the function call overhead makes np.where slower. As to "are they identical"? Not exactly. From np.where docstring:

where(condition, [x, y]): Return elements chosen from x or y depending on condition.

Note: When only condition is provided, this function is a shorthand for np.asarray(condition).nonzero(). Using nonzero directly should be preferred, as it behaves correctly for subclasses. The rest of this documentation covers only the case where all three arguments are provided.

Looking back at the example:

While y[mask] directly selects all matching (True) elements of y, np.where(mask) takes the detour of calculating all (here 2D) index positions for True elements in mask:

In [26]: np.where(mask)
Out[26]:
(array([0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4], dtype=int64),
 array([0, 2, 4, 6, 1, 3, 5, 0, 2, 4, 6, 1, 3, 5, 0, 2, 4, 6], dtype=int64))

In other words: using the boolean mask directly is not only simpler, but avoids a lot of extra computation.

boolean indexing vs np.where

Tags:

python

arrays

indexing

numpy

tbrugere

1 Answers

ojdo

Recent Activity

Donate For Us

boolean indexing vs np.where

Tags:

python

arrays

indexing

numpy

tbrugere

1 Answers

ojdo

Related questions

Recent Activity

Donate For Us