Ok let's say I have a numpy array arr and a boolean array mask of the same shape (for example mask = arr >= 20)
I want an array containing all values of arr where mask is True. I don't really care about the order (I am going to take the sum of this afterwards)
From what I gather from the numpy doc, I can just use boolean indexing :
arr[mask]
Nethertheless, on the internet, I saw a lot of code along the lines of :
arr[np.where(mask)]
Which, I think, does the same, but using index arrays.
Do these two lines really do the same thing ? and if so, is one of them faster ?
As for performance: why not simply measure it? Have a simple example:
In [11]: y = np.arange(35).reshape(5,7)
In [12]: mask = (y % 2 == 0)
In [13]: mask
Out[13]:
array([[ True, False, True, False, True, False, True],
[False, True, False, True, False, True, False],
[ True, False, True, False, True, False, True],
[False, True, False, True, False, True, False],
[ True, False, True, False, True, False, True]])
Then %timeit:
In [14]: %timeit y[mask]
534 ns ± 1.61 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [15]: %timeit y[np.where(mask)]
2.18 µs ± 16.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Unsurprisingly - even if there were no functional differences between both lines - the function call overhead makes np.where slower. As to "are they identical"? Not exactly. From np.where docstring:
where(condition, [x, y]): Return elements chosen fromxorydepending oncondition.Note: When only
conditionis provided, this function is a shorthand fornp.asarray(condition).nonzero(). Usingnonzerodirectly should be preferred, as it behaves correctly for subclasses. The rest of this documentation covers only the case where all three arguments are provided.
Looking back at the example:
While y[mask] directly selects all matching (True) elements of y, np.where(mask) takes the detour of calculating all (here 2D) index positions for True elements in mask:
In [26]: np.where(mask)
Out[26]:
(array([0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4], dtype=int64),
array([0, 2, 4, 6, 1, 3, 5, 0, 2, 4, 6, 1, 3, 5, 0, 2, 4, 6], dtype=int64))
In other words: using the boolean mask directly is not only simpler, but avoids a lot of extra computation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With