Related to this question, I came across an indexing behaviour via Boolean arrays and broadcasting I do not understand. We know it's possible to index a NumPy array in 2 dimensions using integer indices and broadcasting. This is specified in the docs:
a = np.array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
b1 = np.array([False, True, True])
b2 = np.array([True, False, True, False])
c1 = np.where(b1)[0] # i.e. [1, 2]
c2 = np.where(b2)[0] # i.e. [0, 2]
a[c1[:, np.newaxis], c2] # or a[c1[:, None], c2]
array([[ 4, 6],
[ 8, 10]])
However, the same does not work for Boolean arrays.
a[b1[:, None], b2]
IndexError: too many indices for array
The alternative numpy.ix_
works for both integer and Boolean arrays. This seems to be because ix_
performs specific manipulation for Boolean arrays to ensure consistent treatment.
assert np.array_equal(a[np.ix_(b1, b2)], a[np.ix_(c1, c2)])
array([[ 4, 6],
[ 8, 10]])
So my question is: why does broadcasting work with integers, but not with Boolean arrays? Is this behaviour documented? Or am I misunderstanding a more fundamental issue?
We can also index NumPy arrays using a NumPy array of boolean values on one axis to specify the indices that we want to access. This will create a NumPy array of size 3x4 (3 rows and 4 columns) with values from 0 to 11 (value 12 not included).
You can index specific values from a NumPy array using another NumPy array of Boolean values on one axis to specify the indices you want to access. For example, to access the second and third values of array a = np. array([4, 6, 8]) , you can use the expression a[np.
Array indexing is the same as accessing an array element. You can access an array element by referring to its index number. The indexes in NumPy arrays start with 0, meaning that the first element has index 0, and the second has index 1 etc.
As @Divakar noted in comments, Boolean advanced indices behave as if they were first fed through np.nonzero
and then broadcast together, see the relevant documentation for extensive explanations. To quote the docs,
In general if an index includes a Boolean array, the result will be identical to inserting
obj.nonzero()
into the same position and using the integer array indexing mechanism described above.x[ind_1, boolean_array, ind_2]
is equivalent tox[(ind_1,) + boolean_array.nonzero() + (ind_2,)]
.
[...]
Combining multiple Boolean indexing arrays or a Boolean with an integer indexing array can best be understood with theobj.nonzero()
analogy. The functionix_
also supports boolean arrays and will work without any surprises.
In your case broadcasting would not necessarily be a problem, since both arrays have only two nonzero elements. The problem is the number of dimensions in the result:
>>> len(b1[:,None].nonzero())
2
>>> len(b2.nonzero())
1
Consequently the indexing expression a[b1[:,None], b2]
would be equivalent to a[b1[:,None].nonzero() + b2.nonzero()]
, which would put a length-3 tuple inside a
, corresponding to a 3d array index. Hence the error you see about "too many indices".
The surprises mentioned in the docs are very close to your example: what if you hadn't injected that singleton dimension? Starting from a length-3 and a length-4 Boolean array you would've ended up with a length-2 advanced index, i.e. a 1d array of size (2,)
. This is never what you'd want, which is leads us to another piece of trivia in the subject.
There's been a lot of discussion in planning to revamp advanced indexing, see the work-in-progress draft NEP 21. The gist of the issue is that fancy indexing in numpy, while clearly documented, has some very quirky features which aren't practically useful for anything, but which can bite you if you make a mistake by producing surprising results rather than errors.
A relevant quote from the NEP:
Mixed cases involving multiple array indices are also surprising, and only less problematic because the current behavior is so useless that it is rarely encountered in practice. When a boolean array index is mixed with another boolean or integer array, boolean array is converted to integer array indices (equivalent to
np.nonzero()
) and then broadcast. For example, indexing a 2D array of size(2, 2)
likex[[True, False], [True, False]]
produces a 1D vector with shape(1,)
, not a 2D sub-matrix with shape(1, 1)
.
Now, I emphasize that the NEP is very much work-in-progress, but one of the suggestions in the current state of the NEP is to forbid Boolean arrays in advanced indexing cases such as the above, and only allow them in "outer indexing" scenarios, i.e. exactly what np.ix_
would help you do with your Boolean array:
Boolean indexing is conceptionally outer indexing. Broadcasting together with other advanced indices in the manner of legacy indexing [i.e. the current behaviour] is generally not helpful or well defined. A user who wishes the "nonzero" plus broadcast behaviour can thus be expected to do this manually.
My point is that the behaviour of Boolean advanced indices and their deprecation status (or lack thereof) may change in the not-so-distant future.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With