I have recently started studying McKinney's Python for data analysis. This tripped me up in the book: <blockquote> Array slices are views on the original array. This means data is not copied and any modifications to the view will be reflected in the source array ... As NumPy has been designed with large data use case in mind, you could imagine performance and memory problems if NumPy insisted on copying data left to right. </blockquote> Fine. Seems like a sensible design choice. But two pages later it says: <blockquote> Selecting data from an array by boolean indexing always creates a copy of the data, even if the returned array is unchanged. </blockquote> Wait, what? Also, <blockquote> You can even mix and match boolean arrays with slices ... e.g. <code>data[names == 'Bob', 2:]</code> </blockquote> Now what would that return? A view on a copy of the data? And why is this behavior the way it is? Coming from R, I see boolean indexing and location based indexing equally frequently used techniques. If NumPy has been designed to avoid copying memory, what drives this design choice? Thanks.

A copy is needed since with an arbitrary boolean index you have no guarantees that the result can be represented as an ndarray. See: http://scipy-lectures.github.io/advanced/advanced_numpy/#indexing-scheme-strides Slices return views since <blockquote> Everything can be represented by changing only <code>shape</code>, <code>strides</code>, and possibly adjusting the <code>data</code> pointer! </blockquote>

numpy: copying value defaults on integer indexing vs boolean indexing

Tags:

python

arrays

numpy

I have recently started studying McKinney's Python for data analysis. This tripped me up in the book:

Array slices are views on the original array. This means data is not copied and any modifications to the view will be reflected in the source array ... As NumPy has been designed with large data use case in mind, you could imagine performance and memory problems if NumPy insisted on copying data left to right.

Fine. Seems like a sensible design choice. But two pages later it says:

Selecting data from an array by boolean indexing always creates a copy of the data, even if the returned array is unchanged.

Wait, what? Also,

You can even mix and match boolean arrays with slices ... e.g. data[names == 'Bob', 2:]

Now what would that return? A view on a copy of the data? And why is this behavior the way it is? Coming from R, I see boolean indexing and location based indexing equally frequently used techniques. If NumPy has been designed to avoid copying memory, what drives this design choice?

Thanks.

683

asked Apr 24 '14 10:04

asb

2 Answers

Let's assume a 1D array. The data in memory would look something like:

10 | 11 | 12 | 13 | 14 | 15 | 16

Accessing an element by index is trivial. Just take the position of the first element, and jump n steps. So, for arr[2]:

10 | 11 | 12 | 13 | 14 | 15 | 16
           ^

I can get the position in memory with just one multiplication. Fast and easy.

I can do a slice, and say "take only arr2 = arr[2:-1]":

10 | 11 | 12 | 13 | 14 | 15 | 16
           ^----^----^----^

Now, the memory layout is very similar. Getting an element is a multiplication from a new starting point. arr2[1]:

10 | 11 | 12 | 13 | 14 | 15 | 16
  (ignore) -----^----------

You can do a fancier trick, and say arr3 = arr[::2], take all the elements jumping one each.

10 | 11 | 12 | 13 | 14 | 15 | 16
 ^---------^---------^---------^

Again, getting indexes of arr3 is very simple: just do a multiplication, but now the size is bigger. This is what strides are for, they tell you the sizes of the blocks and how to get elements by indexing. Strides are even more powerful in more dimensions. This is, by the way, the way we can turn memory (1D) into a matrix (2D).

Now, we get to boolean arrays. If my mask is: T F T T F F T and I ask you for the third element, you would need to transvers the mask, find which is the third true, and then get its index; thus, very slow. So, when taking a boolean mask we have to make a copy of the data. There are some masks than can be represented with strides, but not in general, so for consistency, always a copy.

As a side note, sometimes, the cost of making a copy is worth performance-wise. If you want to do many operations reading "every fifth element of an array", the data in memory will not be aligned, so the CPU will have to wait for it to be fetched every time. It would then be faster to make a single copy (will be continuous), and work with it.

129

answered Sep 26 '22 01:09

Davidmh

A copy is needed since with an arbitrary boolean index you have no guarantees that the result can be represented as an ndarray.

See: http://scipy-lectures.github.io/advanced/advanced_numpy/#indexing-scheme-strides

Slices return views since

Everything can be represented by changing only shape, strides, and possibly adjusting the data pointer!

answered Sep 25 '22 01:09

YXD

Related questions
                            
                                Why Python does not release memory (under mod_wsgi + Django)
                            
                                pymongo: Advantage of using MongoReplicaSetClient?
                            
                                pandas increment cell value of dataframe with mixed column types
                            
                                Mutating request.base_url in Flask
                            
                                Multiple incompatible subclass instances of InteractiveShellEmbed are being created
                            
                                How to run a subprocess in the background python [duplicate]
                            
                                7-bits up to 0xEF
                            
                                Change the column width in the django admin panel
                            
                                How to validate image format in django ImageField
                            
                                Linux - Weird Python Output
                            
                                In Python 2, what are the performance trade-offs for viewvalues()/viewitems() over itervalues()/iteritems()?
                            
                                Should I use pygame.event.get() or pygame.event.poll()?
                            
                                Send text "http" over python socket
                            
                                Sympy: working with equalities manually
                            
                                "AttributeError: 'list' object has no attribute 'ravel'"
                            
                                Connect terminal IPython to existing notebook kernel
                            
                                get "1" for a one-dimensional numpy.array using a shape-like function
                            
                                Python Boto: How do you specify a subnet id AND a security group?
                            
                                How to do struct.pack and struct.unpack in cython?
                            
                                python utf-8-sig BOM in the middle of the file when appending to the end

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With