Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detect the first unique rows in multiple numpy 2-d arrays

I have multiple numpy 2-d arrays which I want to compare rowwise. The output of my function should be a numpy 2-d array representing all rows of the three inputs arrays. I want to be able to detect the first time that a row occurs, every second or third duplicate row should be flagged as False in the output. It is not possible to have duplicate rows within a single array.

If it is possible I would like to avoid the use of loops, as they slow down the calculation speed.

Example:

array1 = array([[444, 427],
   [444, 428],
   [444, 429],
   [444, 430],
   [445, 421]], dtype=uint64)

array2 = array([[446, 427],
   [446, 440],
   [444, 429],
   [444, 432],
   [445, 421]], dtype=uint64)

array3 = array([[447, 427],
   [446, 441],
   [444, 429],
   [444, 432],
   [445, 421]], dtype=uint64)

# output
array([[True, True, True, True,  True],
   [ True,  True,  False, True,  False],
   [ True,  True,  False, False,  False]], dtype=bool)

Any ideas?

like image 703
Wilmar van Ommeren Avatar asked Dec 08 '22 22:12

Wilmar van Ommeren


2 Answers

Here's a fast vectorized approach:

def find_dupe_rows(*arrays):

    A = np.vstack(arrays)
    rtype = np.dtype((np.void, A.dtype.itemsize*A.shape[1]))
    _, first_idx = np.unique(A.view(rtype), return_index=True)
    out = np.zeros(A.shape[0], np.bool)
    out[first_idx] = True

    return out.reshape(len(arrays), -1)

Example usage:

print(find_dupe_rows(array1, array2, array3))
# [[ True  True  True  True  True]
#  [ True  True False  True False]
#  [ True  True False False False]]

To break that down a bit:

  1. Stack the three subarrays to produce a (15, 2) array:

    A = np.vstack((array1, array2, array3))
    
  2. Use np.unique together with this trick to efficiently find the indices where each unique row first occurs within A:

    rtype = np.dtype((np.void, A.dtype.itemsize * A.shape[1]))
    _, first_idx = np.unique(A.view(rtype), return_index=True)
    
  3. Every row that isn't the first occurrence of a unique row can be treated as a duplicate:

    out = np.zeros(A.shape[0], np.bool)     # output is False by default
    out[first_idx] = True                   # set first occurrences to True
    
  4. Finally, reshape this boolean vector to (narrays, nrows), as per your example output:

    return out.reshape(len(arrays), -1)
    
like image 95
ali_m Avatar answered Dec 11 '22 08:12

ali_m


If you are looking for duplicates along the same row IDs across the 2D input arrays, here's a vectorized approach -

# Concatenate all input 2D arrays to form a single tall 2D array
A = np.vstack((array1,array2,array3))

# Consider the rows of the 2D input arrays as linear indexing tuples. 
# Thus, we can reduce the input size, reduced by length of rows.
# This would help in simplifying the solution ahead and help in performance.
A_lidx = np.ravel_multi_index(A.T,A.max(0)+1).reshape(-1,array1.shape[0])

# Finally use broadcasting to perform elementwise comparison between 
# the elements of each column against themselves. Then, use argmax 
# along the first axis giving us the first indices of the duplicates, 
# which when compared with index ID would lead us to final boolean array.
out = (A_lidx[:,None] == A_lidx).argmax(0) == np.arange(A_lidx.shape[0])[:,None]

If you were looking for a global search across all rows of all 2D input arrays, you need a bit modified approach, like so -

# Concatenate all input 2D arrays to form a single tall 2D array
A = np.vstack((array1,array2,array3))

# Consider the rows of the 2D input arrays as linear indexing tuples. 
# Thus, we can reduce the input size, reduced by length of rows.
# This would help in simplifying the solution ahead and help in performance.
A_lidx = np.ravel_multi_index(A.T,A.max(0)+1).ravel()

# Find first occurances by differentiating sorted version and 
# looking for indices with positive change.
sidx = A_lidx.argsort()
first_occ = sidx[np.append(0,np.where(np.diff(A_lidx[sidx])>0)[0]+1)]

# Finally, set those indices as True in an output array of appropriate length
out = np.in1d(np.arange(len(A_lidx)),first_occ).reshape(3,-1)

Please note that the step to calculate first_occ is basically a crude way of using np.unique(..., return_index=True) as used in `@ali_m's solution.

like image 24
Divakar Avatar answered Dec 11 '22 08:12

Divakar