Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

numpy: copying value defaults on integer indexing vs boolean indexing

I have recently started studying McKinney's Python for data analysis. This tripped me up in the book:

Array slices are views on the original array. This means data is not copied and any modifications to the view will be reflected in the source array ... As NumPy has been designed with large data use case in mind, you could imagine performance and memory problems if NumPy insisted on copying data left to right.

Fine. Seems like a sensible design choice. But two pages later it says:

Selecting data from an array by boolean indexing always creates a copy of the data, even if the returned array is unchanged.

Wait, what? Also,

You can even mix and match boolean arrays with slices ... e.g. data[names == 'Bob', 2:]

Now what would that return? A view on a copy of the data? And why is this behavior the way it is? Coming from R, I see boolean indexing and location based indexing equally frequently used techniques. If NumPy has been designed to avoid copying memory, what drives this design choice?

Thanks.

like image 683
asb Avatar asked Apr 24 '14 10:04

asb


People also ask

What is Boolean indexing in python?

Boolean indexing helps us to select the data from the DataFrames using a boolean vector. We need a DataFrame with a boolean index to use the boolean indexing.

What is fancy indexing in python?

Fancy indexing is conceptually simple: it means passing an array of indices to access multiple array elements at once. For example, consider the following array: import numpy as np rand = np. random. RandomState(42) x = rand.

What is integer array indexing?

Integer array indexing allows selection of arbitrary items in the array based on their N-dimensional index. Each integer array represents a number of indexes into that dimension.

What is negative indexing in numpy array?

Negative indices are interpreted as counting from the end of the array (i.e., if i < 0, it means n_i + i). All arrays generated by basic slicing are always views of the original array. The standard rules of sequence slicing apply to basic slicing on a per-dimension basis (including using a step index).


2 Answers

Let's assume a 1D array. The data in memory would look something like:

10 | 11 | 12 | 13 | 14 | 15 | 16

Accessing an element by index is trivial. Just take the position of the first element, and jump n steps. So, for arr[2]:

10 | 11 | 12 | 13 | 14 | 15 | 16
           ^

I can get the position in memory with just one multiplication. Fast and easy.

I can do a slice, and say "take only arr2 = arr[2:-1]":

10 | 11 | 12 | 13 | 14 | 15 | 16
           ^----^----^----^

Now, the memory layout is very similar. Getting an element is a multiplication from a new starting point. arr2[1]:

10 | 11 | 12 | 13 | 14 | 15 | 16
  (ignore) -----^----------

You can do a fancier trick, and say arr3 = arr[::2], take all the elements jumping one each.

10 | 11 | 12 | 13 | 14 | 15 | 16
 ^---------^---------^---------^

Again, getting indexes of arr3 is very simple: just do a multiplication, but now the size is bigger. This is what strides are for, they tell you the sizes of the blocks and how to get elements by indexing. Strides are even more powerful in more dimensions. This is, by the way, the way we can turn memory (1D) into a matrix (2D).

Now, we get to boolean arrays. If my mask is: T F T T F F T and I ask you for the third element, you would need to transvers the mask, find which is the third true, and then get its index; thus, very slow. So, when taking a boolean mask we have to make a copy of the data. There are some masks than can be represented with strides, but not in general, so for consistency, always a copy.

As a side note, sometimes, the cost of making a copy is worth performance-wise. If you want to do many operations reading "every fifth element of an array", the data in memory will not be aligned, so the CPU will have to wait for it to be fetched every time. It would then be faster to make a single copy (will be continuous), and work with it.

like image 129
Davidmh Avatar answered Sep 26 '22 01:09

Davidmh


A copy is needed since with an arbitrary boolean index you have no guarantees that the result can be represented as an ndarray.

See: http://scipy-lectures.github.io/advanced/advanced_numpy/#indexing-scheme-strides

Slices return views since

Everything can be represented by changing only shape, strides, and possibly adjusting the data pointer!

like image 33
YXD Avatar answered Sep 25 '22 01:09

YXD