Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

numpy: boolean indexing and memory usage

Consider the following numpy code:

A[start:end] = B[mask]

Here:

  • A and B are 2D arrays with the same number of columns;
  • start and end are scalars;
  • mask is a 1D boolean array;
  • (end - start) == sum(mask).

In principle, the above operation can be carried out using O(1) temporary storage, by copying elements of B directly into A.

Is this what actually happens in practice, or does numpy construct a temporary array for B[mask]? If the latter, is there a way to avoid this by rewriting the statement?

like image 561
NPE Avatar asked May 11 '11 09:05

NPE


2 Answers

The line

A[start:end] = B[mask]

will -- according to the Python language definition -- first evaluate the right hand side, yielding a new array containing the selected rows of B and occupying additional memory. The most efficient pure-Python way I'm aware of to avoid this is to use an explicit loop:

from itertools import izip, compress
for i, b in izip(range(start, end), compress(B, mask)):
    A[i] = b

Of course this will be much less time-efficient than your original code, but it only uses O(1) additional memory. Also note that itertools.compress() is available in Python 2.7 or 3.1 or above.

like image 147
Sven Marnach Avatar answered Sep 29 '22 12:09

Sven Marnach


Using boolean arrays as a index is fancy indexing, so numpy needs to make a copy. You could write a cython extension to deal with it, if you getting memory problems.

like image 36
tillsten Avatar answered Sep 29 '22 10:09

tillsten