Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient Python implementation of numpy array comparisons

Background

I have two numpy arrays which I'd like to use to carry out some comparison operations in the most efficient/fast way possible. Both contain only unsigned ints.

pairs is a n x 2 x 3 array, which holds a long list of paired 3D coordinates (for some nomenclature, the pairs array contains a set of pairs...) - i.e.

# full pairs array
In [145]: pairs
Out[145]:
    array([[[1, 2, 4],
        [3, 4, 4]],
        .....
       [[1, 2, 5],
        [5, 6, 5]]])

# each entry contains a pair of 3D coordinates
In [149]: pairs[0]
Out[149]:
array([[1, 2, 4],
       [3, 4, 4]])

positions is an n x 3 array which holds a set of 3D coordinates

In [162]: positions
Out[162]:
array([[ 1,  2,  4],
       [ 3,  4,  5],
       [ 5,  6,  3],
       [ 3,  5,  6],
       [ 6,  7,  5],
       [12,  2,  5]])

Goal I want to create an array which is a subset of the pairs array, but contains ONLY entries where at most one of the pairs is in the positions array - i.e. there should be no pairs where BOTH pairs lie in the positions array. For some domain information, every pair will have at least one of the pair positions inside the positions list.

Approaches tried so far My initial naive approach was to loop over each pair in the pairs array, and subtract each of the two pair positions from the positions vector, determining if in BOTH cases we found a match indicated by the presence of a 0 in both the vectors which come from the subtraction operations:

 if (~(positions-pair[0]).any(axis=1)).any() and 
    (~(positions-pair[1]).any(axis=1)).any():
    # both members of the pair were in the positions array -
    # these weren't the droids we were looking for
    pass
 else:
    # append this set of pairs to a new matrix 

This works fine, and takes advantage of some vectorization, but there is probably a better way to do this?

For some other performance-sensitive portions of this program I've re-written things with Cython which has brought a massive speedup, though in this case (at least based on a naive nested for-loop implementation) this was slightly slower than the approach outlined above.

If people have suggestions I'm happy to profile and report back (I have all the profiling infrastructure set up).

like image 573
Alex Avatar asked Aug 08 '15 17:08

Alex


2 Answers

Approach #1

As mentioned in the question, both arrays contain only unsigned ints, which could be exploited to merge XYZ's into a linear indices equivalent version that would be unique for each unique XYZ triplet. The implementation would look something like this -

maxlen = np.max(pairs,axis=(0,1))
dims = np.append(maxlen[::-1][:-1].cumprod()[::-1],1)

pairs1D = np.dot(pairs.reshape(-1,3),dims)
positions1D = np.dot(positions,dims)
mask_idx = ~(np.in1d(pairs1D,positions1D).reshape(-1,2).all(1))
out = pairs[mask_idx]

Since you are dealing with 3D coordinates, you can also usecdist for checking identical XYZ triplets between the input arrays. Listed next are two implementations with that idea in mind.

Approach #2

from scipy.spatial.distance import cdist

p0 = cdist(pairs[:,0,:],positions)
p1 = cdist(pairs[:,1,:],positions)
out = pairs[((p0==0) | (p1==0)).sum(1)!=2]

Approach #3

mask_idx = ~((cdist(pairs.reshape(-1,3),positions)==0).any(1).reshape(-1,2).all(1))
out = pairs[mask_idx]

Runtime tests -

In [80]: n = 5000
    ...: pairs = np.random.randint(0,100,(n,2,3))
    ...: positions= np.random.randint(0,100,(n,3))
    ...: 

In [81]: def cdist_split(pairs,positions):
    ...:    p0 = cdist(pairs[:,0,:],positions)
    ...:    p1 = cdist(pairs[:,1,:],positions)
    ...:    return pairs[((p0==0) | (p1==0)).sum(1)!=2]
    ...: 
    ...: def cdist_merged(pairs,positions):
    ...:    mask_idx = ~((cdist(pairs.reshape(-1,3),positions)==0).any(1).reshape(-1,2).all(1))
    ...:    return pairs[mask_idx]
    ...: 
    ...: def XYZ_merged(pairs,positions):
    ...:    maxlen = np.max(pairs,axis=(0,1))
    ...:    dims = np.append(maxlen[::-1][:-1].cumprod()[::-1],1)
    ...:    pairs1D = np.dot(pairs.reshape(-1,3),dims)
    ...:    positions1D = np.dot(positions,dims)
    ...:    mask_idx1 = ~(np.in1d(pairs1D,positions1D).reshape(-1,2).all(1))
    ...:    return pairs[mask_idx1]
    ...: 

In [82]: %timeit cdist_split(pairs,positions)
1 loops, best of 3: 662 ms per loop

In [83]: %timeit cdist_merged(pairs,positions)
1 loops, best of 3: 615 ms per loop

In [84]: %timeit XYZ_merged(pairs,positions)
100 loops, best of 3: 4.02 ms per loop

Verify results -

In [85]: np.allclose(cdist_split(pairs,positions),cdist_merged(pairs,positions))
Out[85]: True

In [86]: np.allclose(cdist_split(pairs,positions),XYZ_merged(pairs,positions))
Out[86]: True
like image 101
Divakar Avatar answered Sep 28 '22 03:09

Divakar


Elaborating on my comment:

Expand pairs to be more interesting. Feel free to test with larger, more realistic, array:

In [260]: pairs = np.array([[[1,2,4],[3,4,4]],[[1,2,5],[5,6,5]],[[3,4,5],[3,5,6]],[[6,7,5],[1,2,3]]])

In [261]: positions = np.array([[ 1,  2,  4],
       [ 3,  4,  5],
       [ 5,  6,  3],
       [ 3,  5,  6],
       [ 6,  7,  5],
       [12,  2,  5]])

Expand both arrays into broadcastable shapes:

In [262]: I = pairs[None,...]==positions[:,None,None,:]

In [263]: I.shape
Out[263]: (6, 4, 2, 3)

Large boolean array, showing element by element matches on all dimensions. Fell free to substitute other comparisons (difference ==0, np.isclose for floats, etc).

In [264]: J = I.all(axis=-1).any(axis=0).sum(axis=-1)

In [265]: J
Out[265]: array([1, 0, 2, 1])

Consolidate results on various dimensions. Match all numbers over coordinates, match any over positions, count matches over pairs.

In [266]: pairs[J==1,...]
Out[266]: 
array([[[1, 2, 4],
        [3, 4, 4]],

       [[6, 7, 5],
        [1, 2, 3]]])

J==1 represent elements where only one value of the pair matches. (see end note)

The combination of any, and, and sum work this for test case, but might need adjustment with a larger test case(s). But idea is generally applicable.


For the size of arrays that https://stackoverflow.com/a/31901675/901925 tests, my solution is quite slow. Particularly it's doing the == test that results in I with shape (5000, 5000, 2, 3).

Compressing the last dimension helps a lot

dims = np.array([10000,100,1])  # simpler version of dims from XYZmerged
pairs1D = np.dot(pairs.reshape(-1,3),dims)
positions1D = np.dot(positions,dims)
I1d = pairs1D[:,None]==positions1D[None,:]
J1d = I1d.any(axis=-1).reshape(pairs.shape[:2]).sum(axis=-1)

I changed the J1d expression to match mine - to count the number of matches per pair.

The in1d1 that Divakar uses is even faster:

mask = np.in1d(pairs1D, positions1D).reshape(-1,2)
Jmask = mask.sum(axis=-1)

I just realized that the OP is asking for at most one of the pairs is in the positions array. Where as I'm testing for exactly one match per pair. So my tests should all be changed to pairs[J<2,...].

(in my particular random sample for n=5000, that turns out to be everything. There aren't any pairs where both are found in positions. 54 out of the 5000 have J==1, the rest are 0, no match).

like image 34
hpaulj Avatar answered Sep 28 '22 04:09

hpaulj