Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Indices that intersect and sort two numpy arrays

I have two numpy arrays of integers, both of length several hundred million. Within each array values are unique, and each is initially unsorted.

I would like the indices to each that yield their sorted intersection. For example:

x = np.array([4, 1, 10, 5, 8, 13, 11])
y = np.array([20, 5, 4, 9, 11, 7, 25])

Then the sorted intersection of these is [4, 5, 11], and so we want the indices that turn each of x and y into that array, so we want it to return:

mx = np.array([0, 3, 6])
my = np.array([2, 1, 4])

since then x[mx] == y[my] == np.intersect1d(x, y)

The only solution we have so far involves three different argsorts, so it seems that is unlikely to be optimal.

Each value represents a galaxy, in case that makes the problem more fun.

like image 676
JoeZuntz Avatar asked Nov 04 '15 18:11

JoeZuntz


People also ask

How do you find the intersection of two NumPy arrays?

Step 1: Import numpy. Step 2: Define two numpy arrays. Step 3: Find intersection between the arrays using the numpy. intersect1d() function.

How do you get the positions where elements of two arrays match in NumPy?

import numpy as np A = np. array([[1, 1], [2, 2]]) B = np. array([[1, 1], [2, 2]]) print(A == B) In this resulting matrix, each element is a result of a comparison of two corresponding elements in the two arrays.

Do NumPy arrays have indices?

You can access an array element by referring to its index number. The indexes in NumPy arrays start with 0, meaning that the first element has index 0, and the second has index 1 etc.


2 Answers

Here's an option based on intersect1d's implementation, which is fairly straightforward. It requires one call to argsort.

The admittedly simplistic test passes.

import numpy as np


def my_intersect(x, y):
    """my_intersect(x, y) -> xm, ym
    x, y: 1-d arrays of unique values
    xm, ym: indices into x and y giving sorted intersection
    """
    # basic idea taken from numpy.lib.arraysetops.intersect1d
    aux = np.concatenate((x, y))
    sidx = aux.argsort()
    # Note: intersect1d uses aux[:-1][aux[1:]==aux[:-1]] here - I don't know why the first [:-1] is necessary
    inidx = aux[sidx[1:]] == aux[sidx[:-1]]

    # quicksort is not stable, so must do some work to extract indices
    # (if stable, sidx[inidx.nonzero()]  would be for x)
    # interlace the two sets of indices, and check against lengths
    xym = np.vstack((sidx[inidx.nonzero()],
                     sidx[1:][inidx.nonzero()])).T.flatten()

    xm = xym[xym < len(x)]
    ym = xym[xym >= len(x)] - len(x)

    return xm, ym


def check_my_intersect(x, y):
    mx, my = my_intersect(x, y)
    assert (x[mx] == np.intersect1d(x, y)).all()

    # not really necessary: np.intersect1d returns a sorted list
    assert (x[mx] == sorted(x[mx])).all()
    assert (x[mx] == y[my]).all()


def random_unique_unsorted(n):
    while True:
        x = np.unique(np.random.randint(2*n, size=n))
        if len(x):
            break
    np.random.shuffle(x)
    return x


x = np.array([4, 1, 10, 5, 8, 13, 11])
y = np.array([20, 5, 4, 9, 11, 7, 25])

check_my_intersect(x, y)


for i in range(20):
    x = random_unique_unsorted(100+i)
    y = random_unique_unsorted(200+i)
    check_my_intersect(x, y)

Edit: "Note" comment was confusing (Used ... as speech ellipsis, forgot it was a Python operator too).

like image 69
Rory Yorke Avatar answered Sep 21 '22 14:09

Rory Yorke


You could also use np.searchsorted, like so -

def searchsorted_based(x,y):

    # Get argsort for both x and y
    xsort_idx = x.argsort()
    ysort_idx = y.argsort()

    # Sort x and y and store them
    X = x[xsort_idx]
    Y = y[ysort_idx]

    # Find positions of Y in X and the matches by the positions that 
    # shift between 'left' and 'right' based searches. 
    # Use the matches posotions to get corresponding argsort for X.
    x1 = np.searchsorted(X,Y,'left') 
    x2 = np.searchsorted(X,Y,'right') 
    out1 = xsort_idx[x1[x2 != x1]]

    # Repeat for X in Y findings
    y1 = np.searchsorted(Y,X,'left') 
    y2 = np.searchsorted(Y,X,'right')
    out2 = ysort_idx[y1[y2 != y1]]

    return out1, out2

Sample run -

In [100]: x = np.array([4, 1, 10, 5, 8, 13, 11])
     ...: y = np.array([20, 5, 4, 9, 11, 7, 25])
     ...: 

In [101]: searchsorted_based(x,y)
Out[101]: (array([0, 3, 6]), array([2, 1, 4]))
like image 36
Divakar Avatar answered Sep 20 '22 14:09

Divakar