Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get non-duplicate rows from numpy array

Let's say I have a numpy array of the form

x = np.array([[2, 5],
              [3, 4],
              [1, 3],
              [2, 5],
              [4, 5],
              [1, 3],
              [1, 4],
              [3, 4]])

What I would like to get from this is an array which contains only the rows which are NOT duplicates, i.e., I expect from this example

array([[4, 5],
       [1, 4]])

I'm looking for a method which is reasonably fast and scales well. The only way that I can think to do this is

  1. First find the set of unique rows in x, as a new array y.
  2. Create a new array z which has those individual elements of y removed from x, thus z is a list of the duplicated rows in x.
  3. Do a set difference between x and z.

This seems horribly inefficient though. Anyone have a better way?

If it is important, I'm guaranteed that each of my rows will be sorted smallest to largest so that you'll never have a row be [5, 2] or [3, 1].

like image 208
zephyr Avatar asked Apr 29 '16 20:04

zephyr


People also ask

How do I find unique rows in Numpy?

To find unique rows in a NumPy array we are using numpy. unique() function of NumPy library.

How do you delete duplicate rows in Numpy array?

The unique() method is a built-in method in the numpy, that takes an array as input and return a unique array i.e by removing all the duplicate elements. In order to remove duplicates we will pass the given NumPy array to the unique() method and it will return the unique array.

How do I count unique values in a numpy array?

To count each unique element's number of occurrences in the numpy array, we can use the numpy. unique() function. It takes the array as an input argument and returns all the unique elements inside the array in ascending order.

What does Numpy unique return?

unique. This function returns an array of unique elements in the input array. The function can be able to return a tuple of array of unique vales and an array of associated indices.


3 Answers

Approach #1

Here's an approach based on np.unique and considering each row as an indexing tuple for efficiency (assuming that the input array has integers) -

# Consider each row as indexing tuple & get linear indexing value             
lid = np.ravel_multi_index(x.T,x.max(0)+1)

# Get counts and unique indices
_,idx,count = np.unique(lid,return_index=True,return_counts=True)

# See which counts are exactly 1 and select the corresponding unique indices 
# and thus the correspnding rows from input as the final output
out = x[idx[count==1]]

Note: If there is a huge number of columns in the input array, you might want to get the linear indices lid manually, for which you can use np.cumprod, like so -

lid = x.dot(np.append(1,(x.max(0)+1)[::-1][:-1].cumprod())[::-1])

Approach #2

Here's an alternative one that offloads the counting task to np.bincount, which might be more efficient for such purposes -

# Consider each row as indexing tuple & get linear indexing value             
lid = np.ravel_multi_index(x.T,x.max(0)+1)

# Get unique indices and tagged indices for all elements
_,unq_idx,tag_idx = np.unique(lid,return_index=True,return_inverse=True)

# Use the tagged indices to count and look for count==1 and repeat like before
out = x[unq_idx[np.bincount(tag_idx)==1]]

Approach #3

Here's a different approach using convolution to catch such a pattern. Let the inlined comments help out to understand the underlying idea. Here goes -

# Consider each row as indexing tuple & get linear indexing value             
lid = np.ravel_multi_index(x.T,x.max(0)+1)

# Store sorted indices for lid
sidx = lid.argsort()

# Append 1s at either ends of sorted and differentiated version of lid
mask = np.hstack((True,np.diff(lid[sidx])!=0,True))

# Perform convolution on it. Thus non duplicate elements would have
# consecutive two True elements, which could be caught with convolution
# kernel of [1,1]. Get the corresponding mask. 
# Index into sorted indices with it for final output
out = x[sidx[(np.convolve(mask,[1,1])>1)[1:-1]]]
like image 121
Divakar Avatar answered Oct 17 '22 19:10

Divakar


Here is a pandas approach:

pd.DataFrame(x.T).T.drop_duplicates(keep=False).as_matrix()

#array([[4, 5],
#       [1, 4]])
like image 35
Colonel Beauvel Avatar answered Oct 17 '22 19:10

Colonel Beauvel


One possibility (requiring a lot of memory for arrays containing a lot of elements) is by first creating a boolean mask where the rows are equal:

b = np.sum(x[:, None, :] == x, axis=2)
b
array([[2, 0, 0, 2, 1, 0, 0, 0],
       [0, 2, 0, 0, 0, 0, 1, 2],
       [0, 0, 2, 0, 0, 2, 1, 0],
       [2, 0, 0, 2, 1, 0, 0, 0],
       [1, 0, 0, 1, 2, 0, 0, 0],
       [0, 0, 2, 0, 0, 2, 1, 0],
       [0, 1, 1, 0, 0, 1, 2, 1],
       [0, 2, 0, 0, 0, 0, 1, 2]])

This array shows which row has how many equal elements with another row. The diagonal is comparing the row with itself so needs to be set to zero:

np.fill_diagonal(b, 0)
b
array([[0, 0, 0, 2, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 2],
       [0, 0, 0, 0, 0, 2, 1, 0],
       [2, 0, 0, 0, 1, 0, 0, 0],
       [1, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 2, 0, 0, 0, 1, 0],
       [0, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 0, 0, 0, 1, 0]])

Now let's see what is the maximum for each row:

c = np.max(b, axis=0)
c
array([2, 2, 2, 2, 1, 2, 1, 2])

and then we need to find the values where this maximum is !=2 and index these from the original array:

x[np.where([c != 2])[1]]
array([[4, 5],
       [1, 4]])
like image 1
MSeifert Avatar answered Oct 17 '22 19:10

MSeifert