I have a 2-D numpy array with 100,000+ rows. I need to return a subset of those rows (and I need to perform that operations many 1,000s of times, so efficiency is important).
A mock-up example is like this:
import numpy as np
a = np.array([[1,5.5],
[2,4.5],
[3,9.0],
[4,8.01]])
b = np.array([2,4])
So...I want to return the array from a with rows identified in the first column by b:
c=[[2,4.5],
[4,8.01]]
The difference, of course, is that there are many more rows in both a and b, so I'd like to avoid looping. Also, I played with making a dictionary and using np.nonzero but still am a bit stumped.
Thanks in advance for any ideas!
EDIT: Note that, in this case, b are identifiers rather than indices. Here's a revised example:
import numpy as np
a = np.array([[102,5.5],
[204,4.5],
[343,9.0],
[40,8.01]])
b = np.array([102,343])
And I want to return:
c = [[102,5.5],
[343,9.0]]
We use arrayname. length to determine the number of rows in a 2D array because the length of a 2D array is equal to the number of rows it has. The number of columns may vary row to row, which is why the number of rows is used as the length of the 2D array.
Get unique rows from complete 2D-array by passing axis = 0 in unique function along with 2D-array. You will notice that rows 1 and 4 are the same hence one of the columns is excluded.
Unless you are talking about static arrays, 1D is faster. Clearly the 2D case loses the cache locality and uses more memory. It also introduces an extra indirection (and thus an extra pointer to follow) but the first array has the overhead of calculating the indices so these even out more or less. Save this answer.
EDIT: Deleted my original answer since it was a misunderstanding of the question. Instead try:
ii = np.where((a[:,0] - b.reshape(-1,1)) == 0)[1]
c = a[ii,:]
What I'm doing is using broadcasting to subtract each element of b
from a
, and then searching for zeros in that array which indicate a match. This should work, but you should be a little careful with comparison of floats, especially if b is not an array of ints.
EDIT 2 Thanks to Sven's suggestion, you can try this slightly modified version instead:
ii = np.where(a[:,0] == b.reshape(-1,1))[1]
c = a[ii,:]
It's a bit faster than my original implementation.
EDIT 3 The fastest solution by far (~10x faster than Sven's second solution for large arrays) is:
c = a[np.searchsorted(a[:,0],b),:]
Assuming that a[:,0]
is sorted and all values of b
appear in a[:,0]
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With