I need to find the row indices of all rows in a numpy array that differ only by sign. For example if I have the array:
>>> A
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 0, -1, -2],
[ 9, 5, 6],
[-3, -4, -5]])
I would want the output to be [(0,2),(1,4)]
I know how to find unique rows, numpy.unique, so my intuition was to append the array to the negation of itself, i.e. numpy.concatenate(A,-1*A), and then find non unique rows but I am getting confused about how to extract the info I need from that. Also the array might be pretty large so appending it to itself might not be a great idea.
I am getting the right answer by just looping over the array and checking if a row index is equal to the negation of another row index but that is taking a long time. I would like something as fast as numpy.unique.
I have already removed all duplicate rows from A if that makes any difference in the process.
Here's a mostly NumPy based one -
def group_dup_rowids(a):
sidx = np.lexsort(a.T)
b = a[sidx]
m = np.concatenate(([False], (b[1:] == b[:-1]).all(1), [False] ))
idx = np.flatnonzero(m[1:] != m[:-1])
C = sidx.tolist()
return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]
out = group_dup_rowids(np.abs(a))
Sample run -
In [175]: a
Out[175]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 0, -1, -2],
[ 9, 5, 6],
[-3, -4, -5]])
In [176]: group_dup_rowids(np.abs(a))
Out[176]: [[0, 2], [1, 4]]
For the case where you are looking for exact negation paired matches, we just need a minor modification -
def group_dup_rowids_negation(ar):
a = np.abs(ar)
sidx = np.lexsort(a.T)
b = ar[sidx]
m = np.concatenate(([False], (b[1:] == -b[:-1]).all(1), [False] ))
idx = np.flatnonzero(m[1:] != m[:-1])
C = sidx.tolist()
return [(C[i:j]) for i,j in zip(idx[::2],idx[1::2]+1)]
Sample run -
In [354]: a
Out[354]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 0, -1, -2],
[ 9, 5, 6],
[-3, -4, -5]])
In [355]: group_dup_rowids_negation(a)
Out[355]: [[0, 2], [1, 4]]
In [356]: a[-1] = [-3,4,-5]
In [357]: group_dup_rowids_negation(a)
Out[357]: [[0, 2]]
Runtime test
Other working approaches -
# @Joe Iddon's soln
def for_for_if_listcompr(a):
return [(i, j) for i in range(len(a)) for j in range(i+1, len(a))
if all(a[i] == -a[j])]
# @dkato's soln
def find_pairs(A):
res = []
for r1 in range(len(A)):
for r2 in range(r1+1, len(A)):
if all(A[r1] == -A[r2]):
res.append((r1, r2))
return res
Timings -
In [492]: # Setup bigger input case
...: import pandas as pd
...: np.random.seed(0)
...: N = 2000 # datasize decider
...: a0 = np.random.randint(0,9,(N,10))
...: a = a0[np.random.choice(len(a0),4*N)]
...: a[np.random.choice(len(a),2*N, replace=0)] *= -1
...: a = pd.DataFrame(a).drop_duplicates().values
In [493]: %timeit for_for_if_listcompr(a)
...: %timeit find_pairs(a)
1 loop, best of 3: 6.1 s per loop
1 loop, best of 3: 6.05 s per loop
In [494]: %timeit group_dup_rowids_negation(a)
100 loops, best of 3: 2.05 ms per loop
Further improvements
def group_dup_rowids_negation_mod1(ar):
a = np.abs(ar)
sidx = np.lexsort(a.T)
b = ar[sidx]
dp = view1D(b)
dn = view1D(-b)
m = np.concatenate(([False], dp[1:] == dn[:-1], [False] ))
return zip(sidx[m[1:]], sidx[m[:-1]])
def group_dup_rowids_negation_mod2(ar):
a = np.abs(ar)
sidx = lexsort_cols_posnum(a)
b = ar[sidx]
dp = view1D(b)
dn = view1D(-b)
m = np.concatenate(([False], dp[1:] == dn[:-1], [False] ))
return zip(sidx[m[1:]], sidx[m[:-1]])
Helper functions :
# https://stackoverflow.com/a/44999009/ @Divakar
def view1D(a): # a is array
a = np.ascontiguousarray(a)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel()
# Used to convert each row as a scalar by considering each of them as
# an indexing tuple and getting argsort indices
def lexsort_cols_posnum(ar):
shp = ar.max(0)+1
s = np.concatenate((np.asarray(shp[1:])[::-1].cumprod()[::-1],[1]))
return ar.dot(s).argsort()
Runtime test (borrowed from @Paul Panzer's benchmarking) -
In [628]: N = 50000 # datasize decider
...: a0 = np.random.randint(0,99,(N,3))
...: a = a0[np.random.choice(len(a0),4*N)]
...: a[np.random.choice(len(a),2*N, replace=0)] *= -1
...: # OP says no dups
...: a = np.unique(a, axis=0)
...: np.random.shuffle(a)
In [629]: %timeit use_unique(a) # @Paul Panzer's soln
10 loops, best of 3: 33.9 ms per loop
In [630]: %timeit group_dup_rowids_negation(a)
10 loops, best of 3: 54.1 ms per loop
In [631]: %timeit group_dup_rowids_negation_mod1(a)
10 loops, best of 3: 37.4 ms per loop
In [632]: %timeit group_dup_rowids_negation_mod2(a)
100 loops, best of 3: 17.3 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With