Let's say I have a numpy array of the form <pre class="prettyprint"><code>x = np.array([[2, 5], [3, 4], [1, 3], [2, 5], [4, 5], [1, 3], [1, 4], [3, 4]]) </code></pre> What I would like to get from this is an array which contains only the rows which are NOT duplicates, i.e., I expect from this example <pre class="prettyprint"><code>array([[4, 5], [1, 4]]) </code></pre> I'm looking for a method which is reasonably fast and scales well. The only way that I can think to do this is <ol> <li>First find the set of unique rows in <code>x</code>, as a new array <code>y</code>.</li> <li>Create a new array <code>z</code> which has those individual elements of <code>y</code> removed from <code>x</code>, thus <code>z</code> is a list of the duplicated rows in <code>x</code>.</li> <li>Do a set difference between <code>x</code> and <code>z</code>.</li> </ol> This seems horribly inefficient though. Anyone have a better way? If it is important, I'm guaranteed that each of my rows will be sorted smallest to largest so that you'll never have a row be <code>[5, 2]</code> or <code>[3, 1]</code>.

Approach #1 Here's an approach based on <code>np.unique</code> and considering each row as an indexing tuple for efficiency (assuming that the input array has integers) - <pre class="prettyprint"><code># Consider each row as indexing tuple & get linear indexing value lid = np.ravel_multi_index(x.T,x.max(0)+1) # Get counts and unique indices _,idx,count = np.unique(lid,return_index=True,return_counts=True) # See which counts are exactly 1 and select the corresponding unique indices # and thus the correspnding rows from input as the final output out = x[idx[count==1]] </code></pre> Note: If there is a huge number of columns in the input array, you might want to get the linear indices <code>lid</code> manually, for which you can use <code>np.cumprod</code>, like so - <pre class="prettyprint"><code>lid = x.dot(np.append(1,(x.max(0)+1)[::-1][:-1].cumprod())[::-1]) </code></pre> Approach #2 Here's an alternative one that offloads the counting task to <code>np.bincount</code>, which might be more efficient for such purposes - <pre class="prettyprint"><code># Consider each row as indexing tuple & get linear indexing value lid = np.ravel_multi_index(x.T,x.max(0)+1) # Get unique indices and tagged indices for all elements _,unq_idx,tag_idx = np.unique(lid,return_index=True,return_inverse=True) # Use the tagged indices to count and look for count==1 and repeat like before out = x[unq_idx[np.bincount(tag_idx)==1]] </code></pre> Approach #3 Here's a different approach using <code>convolution</code> to catch such a pattern. Let the inlined comments help out to understand the underlying idea. Here goes - <pre class="prettyprint"><code># Consider each row as indexing tuple & get linear indexing value lid = np.ravel_multi_index(x.T,x.max(0)+1) # Store sorted indices for lid sidx = lid.argsort() # Append 1s at either ends of sorted and differentiated version of lid mask = np.hstack((True,np.diff(lid[sidx])!=0,True)) # Perform convolution on it. Thus non duplicate elements would have # consecutive two True elements, which could be caught with convolution # kernel of [1,1]. Get the corresponding mask. # Index into sorted indices with it for final output out = x[sidx[(np.convolve(mask,[1,1])>1)[1:-1]]] </code></pre>

Here is a <code>pandas</code> approach: <pre class="prettyprint"><code>pd.DataFrame(x.T).T.drop_duplicates(keep=False).as_matrix() #array([[4, 5], # [1, 4]]) </code></pre>

One possibility (requiring a lot of memory for arrays containing a lot of elements) is by first creating a boolean mask where the rows are equal: <pre class="prettyprint"><code>b = np.sum(x[:, None, :] == x, axis=2) b array([[2, 0, 0, 2, 1, 0, 0, 0], [0, 2, 0, 0, 0, 0, 1, 2], [0, 0, 2, 0, 0, 2, 1, 0], [2, 0, 0, 2, 1, 0, 0, 0], [1, 0, 0, 1, 2, 0, 0, 0], [0, 0, 2, 0, 0, 2, 1, 0], [0, 1, 1, 0, 0, 1, 2, 1], [0, 2, 0, 0, 0, 0, 1, 2]]) </code></pre> This array shows which row has how many equal elements with another row. The diagonal is comparing the row with itself so needs to be set to zero: <pre class="prettyprint"><code>np.fill_diagonal(b, 0) b array([[0, 0, 0, 2, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 1, 2], [0, 0, 0, 0, 0, 2, 1, 0], [2, 0, 0, 0, 1, 0, 0, 0], [1, 0, 0, 1, 0, 0, 0, 0], [0, 0, 2, 0, 0, 0, 1, 0], [0, 1, 1, 0, 0, 1, 0, 1], [0, 2, 0, 0, 0, 0, 1, 0]]) </code></pre> Now let's see what is the maximum for each row: <pre class="prettyprint"><code>c = np.max(b, axis=0) c array([2, 2, 2, 2, 1, 2, 1, 2]) </code></pre> and then we need to find the values where this maximum is <code>!=2</code> and index these from the original array: <pre class="prettyprint"><code>x[np.where([c != 2])[1]] array([[4, 5], [1, 4]]) </code></pre>

Get non-duplicate rows from numpy array

Tags:

python

arrays

numpy

Let's say I have a numpy array of the form

x = np.array([[2, 5],
              [3, 4],
              [1, 3],
              [2, 5],
              [4, 5],
              [1, 3],
              [1, 4],
              [3, 4]])

What I would like to get from this is an array which contains only the rows which are NOT duplicates, i.e., I expect from this example

array([[4, 5],
       [1, 4]])

I'm looking for a method which is reasonably fast and scales well. The only way that I can think to do this is

First find the set of unique rows in x, as a new array y.
Create a new array z which has those individual elements of y removed from x, thus z is a list of the duplicated rows in x.
Do a set difference between x and z.

This seems horribly inefficient though. Anyone have a better way?

If it is important, I'm guaranteed that each of my rows will be sorted smallest to largest so that you'll never have a row be [5, 2] or [3, 1].

208

asked Apr 29 '16 20:04

zephyr

3 Answers

Approach #1

Here's an approach based on np.unique and considering each row as an indexing tuple for efficiency (assuming that the input array has integers) -

# Consider each row as indexing tuple & get linear indexing value             
lid = np.ravel_multi_index(x.T,x.max(0)+1)

# Get counts and unique indices
_,idx,count = np.unique(lid,return_index=True,return_counts=True)

# See which counts are exactly 1 and select the corresponding unique indices 
# and thus the correspnding rows from input as the final output
out = x[idx[count==1]]

Note: If there is a huge number of columns in the input array, you might want to get the linear indices lid manually, for which you can use np.cumprod, like so -

lid = x.dot(np.append(1,(x.max(0)+1)[::-1][:-1].cumprod())[::-1])

Approach #2

Here's an alternative one that offloads the counting task to np.bincount, which might be more efficient for such purposes -

# Consider each row as indexing tuple & get linear indexing value             
lid = np.ravel_multi_index(x.T,x.max(0)+1)

# Get unique indices and tagged indices for all elements
_,unq_idx,tag_idx = np.unique(lid,return_index=True,return_inverse=True)

# Use the tagged indices to count and look for count==1 and repeat like before
out = x[unq_idx[np.bincount(tag_idx)==1]]

Approach #3

Here's a different approach using convolution to catch such a pattern. Let the inlined comments help out to understand the underlying idea. Here goes -

# Consider each row as indexing tuple & get linear indexing value             
lid = np.ravel_multi_index(x.T,x.max(0)+1)

# Store sorted indices for lid
sidx = lid.argsort()

# Append 1s at either ends of sorted and differentiated version of lid
mask = np.hstack((True,np.diff(lid[sidx])!=0,True))

# Perform convolution on it. Thus non duplicate elements would have
# consecutive two True elements, which could be caught with convolution
# kernel of [1,1]. Get the corresponding mask. 
# Index into sorted indices with it for final output
out = x[sidx[(np.convolve(mask,[1,1])>1)[1:-1]]]

121

answered Oct 17 '22 19:10

Divakar

Here is a pandas approach:

pd.DataFrame(x.T).T.drop_duplicates(keep=False).as_matrix()

#array([[4, 5],
#       [1, 4]])

answered Oct 17 '22 19:10

Colonel Beauvel

One possibility (requiring a lot of memory for arrays containing a lot of elements) is by first creating a boolean mask where the rows are equal:

b = np.sum(x[:, None, :] == x, axis=2)
b
array([[2, 0, 0, 2, 1, 0, 0, 0],
       [0, 2, 0, 0, 0, 0, 1, 2],
       [0, 0, 2, 0, 0, 2, 1, 0],
       [2, 0, 0, 2, 1, 0, 0, 0],
       [1, 0, 0, 1, 2, 0, 0, 0],
       [0, 0, 2, 0, 0, 2, 1, 0],
       [0, 1, 1, 0, 0, 1, 2, 1],
       [0, 2, 0, 0, 0, 0, 1, 2]])

This array shows which row has how many equal elements with another row. The diagonal is comparing the row with itself so needs to be set to zero:

np.fill_diagonal(b, 0)
b
array([[0, 0, 0, 2, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 2],
       [0, 0, 0, 0, 0, 2, 1, 0],
       [2, 0, 0, 0, 1, 0, 0, 0],
       [1, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 2, 0, 0, 0, 1, 0],
       [0, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 0, 0, 0, 1, 0]])

Now let's see what is the maximum for each row:

c = np.max(b, axis=0)
c
array([2, 2, 2, 2, 1, 2, 1, 2])

and then we need to find the values where this maximum is !=2 and index these from the original array:

x[np.where([c != 2])[1]]
array([[4, 5],
       [1, 4]])

answered Oct 17 '22 19:10

MSeifert

Related questions
                            
                                Why does scipy.optimize.minimize (default) report success without moving with Skyfield?
                            
                                Pandas Plotting from Pivot Table
                            
                                A little confused about rebuild/update_index for Django-Haystack
                            
                                Finding max occurrence of a column's value, after group-by on another column
                            
                                AttributeError: 'bytes' object has no attribute 'encode'; base64 encode a pdf file
                            
                                Stripping a string and getting start index and end index
                            
                                In Python, how to replace all non-UTF-8 characters in a string?
                            
                                Django admin is_staff based on group
                            
                                Polynomial function cannot be solved by Python sympy
                            
                                Is recursion worse than iteration? [closed]
                            
                                How to find row of 2d array in 3d numpy array
                            
                                What range function does to a Python list?
                            
                                Variable not available at Spyder's Variable Explorer when naming it with upper case only
                            
                                insert list of dict into MySQL using Python [closed]
                            
                                python filtering list of dict by dict containing several key-value pairs as conditions
                            
                                Passing an input to a service and saving the result to DB in Django
                            
                                Why is this expression always true when removing "any"?
                            
                                Python - Apply member function in a map
                            
                                (fake_useragent) UserAgent() will not connect
                            
                                Majority Element Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Get non-duplicate rows from numpy array

Tags:

python

arrays

numpy

zephyr

People also ask

3 Answers

Divakar

Colonel Beauvel

MSeifert

Recent Activity

Donate For Us