Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can one efficiently remove a range of rows from a large numpy array?

Tags:

python

numpy

Given a large 2d numpy array, I would like to remove a range of rows, say rows 10000:10010 efficiently. I have to do this multiple times with different ranges, so I would like to also make it parallelizable.

Using something like numpy.delete() is not efficient, since it needs to copy the array, taking too much time and memory. Ideally I would want to do something like create a view, but I am not sure how I could do this in this case. A masked array is also not an option since the downstream operations are not supported on masked arrays.

Any ideas?

like image 715
Bitwise Avatar asked Nov 01 '13 02:11

Bitwise


People also ask

How do I delete multiple rows in NumPy?

Delete multiple rows using slice We can delete multiple rows and rows from the numpy array by using numpy. delete() function using slicing. Where s The first argument is numpy array and the second argument is slicing and third argument is the axis =1 mean,i will delete multiple columns.

How do I delete rows in NumPy array based on condition?

np. delete(ndarray, index, axis): Delete items of rows or columns from the NumPy array based on given index conditions and axis specified, the parameter ndarray is the array on which the manipulation will happen, the index is the particular rows based on conditions to be deleted, axis=0 for removing rows in our case.

How do I remove multiple elements from a NumPy array?

One way to remove multiple elements from a NumPy array is by calling the numpy. delete() function repeatedly for a bunch of indices.


2 Answers

Because of the strided data structure that defines a numpy array, what you want will not be possible without using a masked array. Your best option might be to use a masked array (or perhaps your own boolean array) to mask the deleted the rows, and then do a single real delete operation of all the rows to be deleted before passing it downstream.

like image 119
Warren Weckesser Avatar answered Oct 20 '22 15:10

Warren Weckesser


There isn't really a good way to speed up the delete operation, as you've already alluded to, this kind of deleting requires the data to be copied in memory. The one thing you can do, as suggested by @WarrenWeckesser, is combine multiple delete operations and apply them all at once. Here's an example:

ranges = [(10, 20), (25, 30), (50, 100)]
mask = np.ones(len(array), dtype=bool)

# Update the mask with all the rows you want to delete
for start, end in ranges:
    mask[start:stop] = False

# Apply all the changes at once
new_array = array[mask]

It doesn't really make sense to parallelize this because you're just copying stuff in memory so this will be memory bound anyways, adding more cpus will not help.

like image 24
Bi Rico Avatar answered Oct 20 '22 15:10

Bi Rico