Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Explanation of boolean indexing behaviors

For the 2D array y:

y = np.arange(20).reshape(5,4)
---
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]
 [16 17 18 19]]

All indexing select 1st, 3rd, and 5th rows. This is clear.

print(y[
    [0, 2, 4],
    ::
])
print(y[
    [0, 2, 4],
    ::
])
print(y[
    [True, False, True, False, True],
    ::
])
---
[[ 0  1  2  3]
 [ 8  9 10 11]
 [16 17 18 19]]

Questions

Please help understand what rules or mechanism are working to produce the results.

Replacing [] with tuple produces an empty array with shape (0, 5, 4).

y[
    (True, False, True, False, True)
]
---
array([], shape=(0, 5, 4), dtype=int64)

Use single True adds a new axis.

y[True]
---
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11],
        [12, 13, 14, 15],
        [16, 17, 18, 19]]])


y[True].shape
---
(1, 5, 4)

Adding additional boolean True produces the same.

y[True, True]
---
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11],
        [12, 13, 14, 15],
        [16, 17, 18, 19]]])

y[True, True].shape
---
(1, 5, 4)

However, adding False boolean causes the empty array again.

y[True, False]
---
array([], shape=(0, 5, 4), dtype=int64)

Not sure the documentation explains this behavior.

  • Boolean array indexing

In general if an index includes a Boolean array, the result will be identical to inserting obj.nonzero() into the same position and using the integer array indexing mechanism described above. x[ind_1, boolean_array, ind_2] is equivalent to x[(ind_1,) + boolean_array.nonzero() + (ind_2,)].

If there is only one Boolean array and no integer indexing array present, this is straight forward. Care must only be taken to make sure that the boolean index has exactly as many dimensions as it is supposed to work with.

like image 709
mon Avatar asked Jan 06 '21 04:01

mon


People also ask

What is boolean indexing explain with example?

Boolean indexing helps us to select the data from the DataFrames using a boolean vector. We need a DataFrame with a boolean index to use the boolean indexing. Let's see how to achieve the boolean indexing. Create a dictionary of data. Convert it into a DataFrame object with a boolean index as a vector.

What is boolean array indexing?

In its simplest form, boolean indexing behaves as follows: Suppose x is an -dimensional array, and ind is a boolean-value array of the same shape as x . Then x[ind] returns a 1-dimensional array, which is formed by traversing x and ind using row-major ordering.

What is boolean Numpy array?

The Numpy boolean array is a type of array (collection of values) that can be used to represent logical 'True' or 'False' values stored in an array data structure in the Python programming language.


1 Answers

Boolean scalar indexing is not well-documented, but you can trace how it is handled in the source code. See for example this comment and associated code in the numpy source:

/*
* This can actually be well defined. A new axis is added,
* but at the same time no axis is "used". So if we have True,
* we add a new axis (a bit like with np.newaxis). If it is
* False, we add a new axis, but this axis has 0 entries.
*/

So if an index is a scalar boolean, a new axis is added. If the value is True the size of the axis is 1, and if the value is False, the size of the axis is zero.

This behavior was introduced in numpy#3798, and the author outlines the motivation in this comment; roughly, the aim was to provide consistency in the output of filtering operations. For example:

x = np.ones((2, 2))
assert x[x > 0].ndim == 1

x = np.ones(2)
assert x[x > 0].ndim == 1

x = np.ones(())
assert x[x > 0].ndim == 1  # scalar boolean here!

The interesting thing is that any subsequent scalar booleans after the first do not add additional dimensions! From an implementation standpoint, this seems to be due to consecutive 0D boolean indices being treated as equivalent to consecutive fancy indices (i.e. HAS_0D_BOOL is treated as HAS_FANCY in some cases) and thus are combined in the same way as fancy indices. From a logical standpoint, this corner-case behavior does not appear to be intentional: for example, I can't find any discussion of it in numpy#3798.

Given that, I would recommend considering this behavior poorly-defined, and avoid it in favor of well-documented indexing approaches.

like image 185
jakevdp Avatar answered Sep 27 '22 20:09

jakevdp