I have two numpy arrays of integers, both of length several hundred million. Within each array values are unique, and each is initially unsorted.
I would like the indices to each that yield their sorted intersection. For example:
x = np.array([4, 1, 10, 5, 8, 13, 11])
y = np.array([20, 5, 4, 9, 11, 7, 25])
Then the sorted intersection of these is [4, 5, 11]
, and so we want the indices that turn each of x and y into that array, so we want it to return:
mx = np.array([0, 3, 6])
my = np.array([2, 1, 4])
since then x[mx] == y[my] == np.intersect1d(x, y)
The only solution we have so far involves three different argsorts, so it seems that is unlikely to be optimal.
Each value represents a galaxy, in case that makes the problem more fun.
Step 1: Import numpy. Step 2: Define two numpy arrays. Step 3: Find intersection between the arrays using the numpy. intersect1d() function.
import numpy as np A = np. array([[1, 1], [2, 2]]) B = np. array([[1, 1], [2, 2]]) print(A == B) In this resulting matrix, each element is a result of a comparison of two corresponding elements in the two arrays.
You can access an array element by referring to its index number. The indexes in NumPy arrays start with 0, meaning that the first element has index 0, and the second has index 1 etc.
Here's an option based on intersect1d
's implementation, which is fairly straightforward. It requires one call to argsort
.
The admittedly simplistic test passes.
import numpy as np
def my_intersect(x, y):
"""my_intersect(x, y) -> xm, ym
x, y: 1-d arrays of unique values
xm, ym: indices into x and y giving sorted intersection
"""
# basic idea taken from numpy.lib.arraysetops.intersect1d
aux = np.concatenate((x, y))
sidx = aux.argsort()
# Note: intersect1d uses aux[:-1][aux[1:]==aux[:-1]] here - I don't know why the first [:-1] is necessary
inidx = aux[sidx[1:]] == aux[sidx[:-1]]
# quicksort is not stable, so must do some work to extract indices
# (if stable, sidx[inidx.nonzero()] would be for x)
# interlace the two sets of indices, and check against lengths
xym = np.vstack((sidx[inidx.nonzero()],
sidx[1:][inidx.nonzero()])).T.flatten()
xm = xym[xym < len(x)]
ym = xym[xym >= len(x)] - len(x)
return xm, ym
def check_my_intersect(x, y):
mx, my = my_intersect(x, y)
assert (x[mx] == np.intersect1d(x, y)).all()
# not really necessary: np.intersect1d returns a sorted list
assert (x[mx] == sorted(x[mx])).all()
assert (x[mx] == y[my]).all()
def random_unique_unsorted(n):
while True:
x = np.unique(np.random.randint(2*n, size=n))
if len(x):
break
np.random.shuffle(x)
return x
x = np.array([4, 1, 10, 5, 8, 13, 11])
y = np.array([20, 5, 4, 9, 11, 7, 25])
check_my_intersect(x, y)
for i in range(20):
x = random_unique_unsorted(100+i)
y = random_unique_unsorted(200+i)
check_my_intersect(x, y)
Edit: "Note" comment was confusing (Used ...
as speech ellipsis, forgot it was a Python operator too).
You could also use np.searchsorted
, like so -
def searchsorted_based(x,y):
# Get argsort for both x and y
xsort_idx = x.argsort()
ysort_idx = y.argsort()
# Sort x and y and store them
X = x[xsort_idx]
Y = y[ysort_idx]
# Find positions of Y in X and the matches by the positions that
# shift between 'left' and 'right' based searches.
# Use the matches posotions to get corresponding argsort for X.
x1 = np.searchsorted(X,Y,'left')
x2 = np.searchsorted(X,Y,'right')
out1 = xsort_idx[x1[x2 != x1]]
# Repeat for X in Y findings
y1 = np.searchsorted(Y,X,'left')
y2 = np.searchsorted(Y,X,'right')
out2 = ysort_idx[y1[y2 != y1]]
return out1, out2
Sample run -
In [100]: x = np.array([4, 1, 10, 5, 8, 13, 11])
...: y = np.array([20, 5, 4, 9, 11, 7, 25])
...:
In [101]: searchsorted_based(x,y)
Out[101]: (array([0, 3, 6]), array([2, 1, 4]))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With