Let's say I have a numpy array of the form
x = np.array([[2, 5],
[3, 4],
[1, 3],
[2, 5],
[4, 5],
[1, 3],
[1, 4],
[3, 4]])
What I would like to get from this is an array which contains only the rows which are NOT duplicates, i.e., I expect from this example
array([[4, 5],
[1, 4]])
I'm looking for a method which is reasonably fast and scales well. The only way that I can think to do this is
x
, as a new array y
.z
which has those individual elements of y
removed from x
, thus z
is a list of the duplicated rows in x
.x
and z
.This seems horribly inefficient though. Anyone have a better way?
If it is important, I'm guaranteed that each of my rows will be sorted smallest to largest so that you'll never have a row be [5, 2]
or [3, 1]
.
To find unique rows in a NumPy array we are using numpy. unique() function of NumPy library.
The unique() method is a built-in method in the numpy, that takes an array as input and return a unique array i.e by removing all the duplicate elements. In order to remove duplicates we will pass the given NumPy array to the unique() method and it will return the unique array.
To count each unique element's number of occurrences in the numpy array, we can use the numpy. unique() function. It takes the array as an input argument and returns all the unique elements inside the array in ascending order.
unique. This function returns an array of unique elements in the input array. The function can be able to return a tuple of array of unique vales and an array of associated indices.
Approach #1
Here's an approach based on np.unique
and considering each row as an indexing tuple for efficiency (assuming that the input array has integers) -
# Consider each row as indexing tuple & get linear indexing value
lid = np.ravel_multi_index(x.T,x.max(0)+1)
# Get counts and unique indices
_,idx,count = np.unique(lid,return_index=True,return_counts=True)
# See which counts are exactly 1 and select the corresponding unique indices
# and thus the correspnding rows from input as the final output
out = x[idx[count==1]]
Note: If there is a huge number of columns in the input array, you might want to get the linear indices lid
manually, for which you can use np.cumprod
, like so -
lid = x.dot(np.append(1,(x.max(0)+1)[::-1][:-1].cumprod())[::-1])
Approach #2
Here's an alternative one that offloads the counting task to np.bincount
, which might be more efficient for such purposes -
# Consider each row as indexing tuple & get linear indexing value
lid = np.ravel_multi_index(x.T,x.max(0)+1)
# Get unique indices and tagged indices for all elements
_,unq_idx,tag_idx = np.unique(lid,return_index=True,return_inverse=True)
# Use the tagged indices to count and look for count==1 and repeat like before
out = x[unq_idx[np.bincount(tag_idx)==1]]
Approach #3
Here's a different approach using convolution
to catch such a pattern. Let the inlined comments help out to understand the underlying idea. Here goes -
# Consider each row as indexing tuple & get linear indexing value
lid = np.ravel_multi_index(x.T,x.max(0)+1)
# Store sorted indices for lid
sidx = lid.argsort()
# Append 1s at either ends of sorted and differentiated version of lid
mask = np.hstack((True,np.diff(lid[sidx])!=0,True))
# Perform convolution on it. Thus non duplicate elements would have
# consecutive two True elements, which could be caught with convolution
# kernel of [1,1]. Get the corresponding mask.
# Index into sorted indices with it for final output
out = x[sidx[(np.convolve(mask,[1,1])>1)[1:-1]]]
Here is a pandas
approach:
pd.DataFrame(x.T).T.drop_duplicates(keep=False).as_matrix()
#array([[4, 5],
# [1, 4]])
One possibility (requiring a lot of memory for arrays containing a lot of elements) is by first creating a boolean mask where the rows are equal:
b = np.sum(x[:, None, :] == x, axis=2)
b
array([[2, 0, 0, 2, 1, 0, 0, 0],
[0, 2, 0, 0, 0, 0, 1, 2],
[0, 0, 2, 0, 0, 2, 1, 0],
[2, 0, 0, 2, 1, 0, 0, 0],
[1, 0, 0, 1, 2, 0, 0, 0],
[0, 0, 2, 0, 0, 2, 1, 0],
[0, 1, 1, 0, 0, 1, 2, 1],
[0, 2, 0, 0, 0, 0, 1, 2]])
This array shows which row has how many equal elements with another row. The diagonal is comparing the row with itself so needs to be set to zero:
np.fill_diagonal(b, 0)
b
array([[0, 0, 0, 2, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 2],
[0, 0, 0, 0, 0, 2, 1, 0],
[2, 0, 0, 0, 1, 0, 0, 0],
[1, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 2, 0, 0, 0, 1, 0],
[0, 1, 1, 0, 0, 1, 0, 1],
[0, 2, 0, 0, 0, 0, 1, 0]])
Now let's see what is the maximum for each row:
c = np.max(b, axis=0)
c
array([2, 2, 2, 2, 1, 2, 1, 2])
and then we need to find the values where this maximum is !=2
and index these from the original array:
x[np.where([c != 2])[1]]
array([[4, 5],
[1, 4]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With