I have multiple numpy 2-d arrays which I want to compare rowwise. The output of my function should be a numpy 2-d array representing all rows of the three inputs arrays. I want to be able to detect the first time that a row occurs, every second or third duplicate row should be flagged as False in the output. It is not possible to have duplicate rows within a single array.
If it is possible I would like to avoid the use of loops, as they slow down the calculation speed.
Example:
array1 = array([[444, 427],
[444, 428],
[444, 429],
[444, 430],
[445, 421]], dtype=uint64)
array2 = array([[446, 427],
[446, 440],
[444, 429],
[444, 432],
[445, 421]], dtype=uint64)
array3 = array([[447, 427],
[446, 441],
[444, 429],
[444, 432],
[445, 421]], dtype=uint64)
# output
array([[True, True, True, True, True],
[ True, True, False, True, False],
[ True, True, False, False, False]], dtype=bool)
Any ideas?
Here's a fast vectorized approach:
def find_dupe_rows(*arrays):
A = np.vstack(arrays)
rtype = np.dtype((np.void, A.dtype.itemsize*A.shape[1]))
_, first_idx = np.unique(A.view(rtype), return_index=True)
out = np.zeros(A.shape[0], np.bool)
out[first_idx] = True
return out.reshape(len(arrays), -1)
Example usage:
print(find_dupe_rows(array1, array2, array3))
# [[ True True True True True]
# [ True True False True False]
# [ True True False False False]]
To break that down a bit:
Stack the three subarrays to produce a (15, 2)
array:
A = np.vstack((array1, array2, array3))
Use np.unique
together with this trick to efficiently find the indices where each unique row first occurs within A
:
rtype = np.dtype((np.void, A.dtype.itemsize * A.shape[1]))
_, first_idx = np.unique(A.view(rtype), return_index=True)
Every row that isn't the first occurrence of a unique row can be treated as a duplicate:
out = np.zeros(A.shape[0], np.bool) # output is False by default
out[first_idx] = True # set first occurrences to True
Finally, reshape this boolean vector to (narrays, nrows)
, as per your example output:
return out.reshape(len(arrays), -1)
If you are looking for duplicates along the same row IDs across the 2D
input arrays, here's a vectorized approach -
# Concatenate all input 2D arrays to form a single tall 2D array
A = np.vstack((array1,array2,array3))
# Consider the rows of the 2D input arrays as linear indexing tuples.
# Thus, we can reduce the input size, reduced by length of rows.
# This would help in simplifying the solution ahead and help in performance.
A_lidx = np.ravel_multi_index(A.T,A.max(0)+1).reshape(-1,array1.shape[0])
# Finally use broadcasting to perform elementwise comparison between
# the elements of each column against themselves. Then, use argmax
# along the first axis giving us the first indices of the duplicates,
# which when compared with index ID would lead us to final boolean array.
out = (A_lidx[:,None] == A_lidx).argmax(0) == np.arange(A_lidx.shape[0])[:,None]
If you were looking for a global search across all rows of all 2D
input arrays, you need a bit modified approach, like so -
# Concatenate all input 2D arrays to form a single tall 2D array
A = np.vstack((array1,array2,array3))
# Consider the rows of the 2D input arrays as linear indexing tuples.
# Thus, we can reduce the input size, reduced by length of rows.
# This would help in simplifying the solution ahead and help in performance.
A_lidx = np.ravel_multi_index(A.T,A.max(0)+1).ravel()
# Find first occurances by differentiating sorted version and
# looking for indices with positive change.
sidx = A_lidx.argsort()
first_occ = sidx[np.append(0,np.where(np.diff(A_lidx[sidx])>0)[0]+1)]
# Finally, set those indices as True in an output array of appropriate length
out = np.in1d(np.arange(len(A_lidx)),first_occ).reshape(3,-1)
Please note that the step to calculate first_occ
is basically a crude way of using np.unique(..., return_index=True)
as used in `@ali_m's solution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With