Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NumPy indexing: broadcasting with Boolean arrays

Related to this question, I came across an indexing behaviour via Boolean arrays and broadcasting I do not understand. We know it's possible to index a NumPy array in 2 dimensions using integer indices and broadcasting. This is specified in the docs:

a = np.array([[ 0,  1,  2,  3],
              [ 4,  5,  6,  7],
              [ 8,  9, 10, 11]])

b1 = np.array([False, True, True])
b2 = np.array([True, False, True, False])

c1 = np.where(b1)[0]  # i.e. [1, 2]
c2 = np.where(b2)[0]  # i.e. [0, 2]

a[c1[:, np.newaxis], c2]  # or a[c1[:, None], c2]

array([[ 4,  6],
       [ 8, 10]])

However, the same does not work for Boolean arrays.

a[b1[:, None], b2]

IndexError: too many indices for array

The alternative numpy.ix_ works for both integer and Boolean arrays. This seems to be because ix_ performs specific manipulation for Boolean arrays to ensure consistent treatment.

assert np.array_equal(a[np.ix_(b1, b2)], a[np.ix_(c1, c2)])

array([[ 4,  6],
       [ 8, 10]])

So my question is: why does broadcasting work with integers, but not with Boolean arrays? Is this behaviour documented? Or am I misunderstanding a more fundamental issue?

like image 917
jpp Avatar asked Jul 04 '18 16:07

jpp


People also ask

Does NumPy arrays support boolean indexing?

We can also index NumPy arrays using a NumPy array of boolean values on one axis to specify the indices that we want to access. This will create a NumPy array of size 3x4 (3 rows and 4 columns) with values from 0 to 11 (value 12 not included).

How do you use boolean indexing in NumPy?

You can index specific values from a NumPy array using another NumPy array of Boolean values on one axis to specify the indices you want to access. For example, to access the second and third values of array a = np. array([4, 6, 8]) , you can use the expression a[np.

Can NumPy arrays be indexed?

Array indexing is the same as accessing an array element. You can access an array element by referring to its index number. The indexes in NumPy arrays start with 0, meaning that the first element has index 0, and the second has index 1 etc.


1 Answers

As @Divakar noted in comments, Boolean advanced indices behave as if they were first fed through np.nonzero and then broadcast together, see the relevant documentation for extensive explanations. To quote the docs,

In general if an index includes a Boolean array, the result will be identical to inserting obj.nonzero() into the same position and using the integer array indexing mechanism described above. x[ind_1, boolean_array, ind_2] is equivalent to x[(ind_1,) + boolean_array.nonzero() + (ind_2,)].
[...]
Combining multiple Boolean indexing arrays or a Boolean with an integer indexing array can best be understood with the obj.nonzero() analogy. The function ix_ also supports boolean arrays and will work without any surprises.

In your case broadcasting would not necessarily be a problem, since both arrays have only two nonzero elements. The problem is the number of dimensions in the result:

>>> len(b1[:,None].nonzero())
2
>>> len(b2.nonzero())
1

Consequently the indexing expression a[b1[:,None], b2] would be equivalent to a[b1[:,None].nonzero() + b2.nonzero()], which would put a length-3 tuple inside a, corresponding to a 3d array index. Hence the error you see about "too many indices".

The surprises mentioned in the docs are very close to your example: what if you hadn't injected that singleton dimension? Starting from a length-3 and a length-4 Boolean array you would've ended up with a length-2 advanced index, i.e. a 1d array of size (2,). This is never what you'd want, which is leads us to another piece of trivia in the subject.

There's been a lot of discussion in planning to revamp advanced indexing, see the work-in-progress draft NEP 21. The gist of the issue is that fancy indexing in numpy, while clearly documented, has some very quirky features which aren't practically useful for anything, but which can bite you if you make a mistake by producing surprising results rather than errors.

A relevant quote from the NEP:

Mixed cases involving multiple array indices are also surprising, and only less problematic because the current behavior is so useless that it is rarely encountered in practice. When a boolean array index is mixed with another boolean or integer array, boolean array is converted to integer array indices (equivalent to np.nonzero()) and then broadcast. For example, indexing a 2D array of size (2, 2) like x[[True, False], [True, False]] produces a 1D vector with shape (1,), not a 2D sub-matrix with shape (1, 1).

Now, I emphasize that the NEP is very much work-in-progress, but one of the suggestions in the current state of the NEP is to forbid Boolean arrays in advanced indexing cases such as the above, and only allow them in "outer indexing" scenarios, i.e. exactly what np.ix_ would help you do with your Boolean array:

Boolean indexing is conceptionally outer indexing. Broadcasting together with other advanced indices in the manner of legacy indexing [i.e. the current behaviour] is generally not helpful or well defined. A user who wishes the "nonzero" plus broadcast behaviour can thus be expected to do this manually.

My point is that the behaviour of Boolean advanced indices and their deprecation status (or lack thereof) may change in the not-so-distant future.

like image 157