Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove sub-array within a numpy array if any item in it has already appeared in a previous array

I have an two-dimensional numpy array. I need to filter out duplications - if any item in a row is there in a previous row, then it's considered a duplication.


#i.e.:
arr =
array([[4580, 4581, 4657, 4658],
       [4580, 4581, 4657, 4659], #-> duplicate because of 4580
       [4650, 4652, 4654, 4655],
       [4651, 4655, 4652, 4656]]) #-> duplicate because of 4652

#Output should be: 
array([[4580, 4581, 4657, 4658],
       [4650, 4652, 4654, 4655]])

The below script gives me my expected output for small inputs. It chokes on large arrays, however. I'm sure there's a much simpler and more efficient way to do this but I can't seem to find it.

check = np.array([not(np.in1d(a, np.unique(arr[:i])).any()) for i,a in enumerate(arr)])
arr[check]
like image 383
Kristof Avatar asked Mar 08 '26 18:03

Kristof


1 Answers

You can find unique elements of the array using np.unique. Passing the argument return_index=True returns the indices of the first occurrence of the unique elements. Note that since unique implicitly flattens the array, these values are indices in the flattened array.

unique_elems, unique_indices = np.unique(arr, return_index=True)
# array([4580, 4581, 4650, 4651, 4652, 4654, 4655, 4656, 4657, 4658, 4659]),
# array([ 0,  1,  8, 12,  9, 10, 11, 15,  2,  3,  7], dtype=int64)

Now, we want to select any rows, where all its elements are in the unique_indices array. First, let's create an array which maps the index of an element in the flattened array to its position in arr:

mapping = np.arange(arr.size).reshape(arr.shape)

Now, let's see which indices are in unique_indices:

select_elem = np.isin(mapping, unique_indices)
# array([[ True,  True,  True,  True],
#        [False, False, False,  True],
#        [ True,  True,  True,  True],
#        [ True, False, False,  True]])

And finally, select only the rows of select_elem which are all True:

select_rows = select_elem.all(axis=1)
# array([ True, False,  True, False])

Using this to index into the array, we get the desired result:

result = arr[select_rows]
# array([[4580, 4581, 4657, 4658],
#        [4650, 4652, 4654, 4655]])

Here's how the performance varies with input size:

enter image description here

Timeless's method is (unsurprisingly) nearly identical to yours, since it has the same bottleneck, an iteration in python over the array. The pure-numpy approach I showed above runs significantly faster. I did not time Yossi's method because it gives the wrong result.

like image 154
Pranav Hosangadi Avatar answered Mar 11 '26 07:03

Pranav Hosangadi



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!