I have an two-dimensional numpy array. I need to filter out duplications - if any item in a row is there in a previous row, then it's considered a duplication.
#i.e.:
arr =
array([[4580, 4581, 4657, 4658],
[4580, 4581, 4657, 4659], #-> duplicate because of 4580
[4650, 4652, 4654, 4655],
[4651, 4655, 4652, 4656]]) #-> duplicate because of 4652
#Output should be:
array([[4580, 4581, 4657, 4658],
[4650, 4652, 4654, 4655]])
The below script gives me my expected output for small inputs. It chokes on large arrays, however. I'm sure there's a much simpler and more efficient way to do this but I can't seem to find it.
check = np.array([not(np.in1d(a, np.unique(arr[:i])).any()) for i,a in enumerate(arr)])
arr[check]
You can find unique elements of the array using np.unique. Passing the argument return_index=True returns the indices of the first occurrence of the unique elements. Note that since unique implicitly flattens the array, these values are indices in the flattened array.
unique_elems, unique_indices = np.unique(arr, return_index=True)
# array([4580, 4581, 4650, 4651, 4652, 4654, 4655, 4656, 4657, 4658, 4659]),
# array([ 0, 1, 8, 12, 9, 10, 11, 15, 2, 3, 7], dtype=int64)
Now, we want to select any rows, where all its elements are in the unique_indices array. First, let's create an array which maps the index of an element in the flattened array to its position in arr:
mapping = np.arange(arr.size).reshape(arr.shape)
Now, let's see which indices are in unique_indices:
select_elem = np.isin(mapping, unique_indices)
# array([[ True, True, True, True],
# [False, False, False, True],
# [ True, True, True, True],
# [ True, False, False, True]])
And finally, select only the rows of select_elem which are all True:
select_rows = select_elem.all(axis=1)
# array([ True, False, True, False])
Using this to index into the array, we get the desired result:
result = arr[select_rows]
# array([[4580, 4581, 4657, 4658],
# [4650, 4652, 4654, 4655]])
Here's how the performance varies with input size:

Timeless's method is (unsurprisingly) nearly identical to yours, since it has the same bottleneck, an iteration in python over the array. The pure-numpy approach I showed above runs significantly faster. I did not time Yossi's method because it gives the wrong result.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With