Given a large 2d numpy array, I would like to remove a range of rows, say rows 10000:10010
efficiently. I have to do this multiple times with different ranges, so I would like to also make it parallelizable.
Using something like numpy.delete()
is not efficient, since it needs to copy the array, taking too much time and memory. Ideally I would want to do something like create a view, but I am not sure how I could do this in this case. A masked array is also not an option since the downstream operations are not supported on masked arrays.
Any ideas?
Delete multiple rows using slice We can delete multiple rows and rows from the numpy array by using numpy. delete() function using slicing. Where s The first argument is numpy array and the second argument is slicing and third argument is the axis =1 mean,i will delete multiple columns.
np. delete(ndarray, index, axis): Delete items of rows or columns from the NumPy array based on given index conditions and axis specified, the parameter ndarray is the array on which the manipulation will happen, the index is the particular rows based on conditions to be deleted, axis=0 for removing rows in our case.
One way to remove multiple elements from a NumPy array is by calling the numpy. delete() function repeatedly for a bunch of indices.
Because of the strided data structure that defines a numpy array, what you want will not be possible without using a masked array. Your best option might be to use a masked array (or perhaps your own boolean array) to mask the deleted the rows, and then do a single real delete
operation of all the rows to be deleted before passing it downstream.
There isn't really a good way to speed up the delete operation, as you've already alluded to, this kind of deleting requires the data to be copied in memory. The one thing you can do, as suggested by @WarrenWeckesser, is combine multiple delete operations and apply them all at once. Here's an example:
ranges = [(10, 20), (25, 30), (50, 100)]
mask = np.ones(len(array), dtype=bool)
# Update the mask with all the rows you want to delete
for start, end in ranges:
mask[start:stop] = False
# Apply all the changes at once
new_array = array[mask]
It doesn't really make sense to parallelize this because you're just copying stuff in memory so this will be memory bound anyways, adding more cpus will not help.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With