Fast distance calculation in scipy and numpy

Question

Let A,B be ((day,observation,dim)) arrays. Each array contains for a given day the same number of observations, an observation being a point with dim dimensions (that is dim floats). For every day, I want to compute the spatial distances between all observations in A and B that day.

For example:

import numpy as np
from scipy.spatial.distance import cdist

A, B = np.random.rand(50,1000,10), np.random.rand(50,1000,10)

output = []
for day in range(50):
    output.append(cdist(A[day],B[day]))

where I use scipy.spatial.distance.cdist.

Is there a faster way to do this? Ideally, I would like to get for output a ((day,observation,observation)) array that contains for every day the pairwise distances between observations in A and B that day, whilst somehow avoid the loop over days.

EelkeSpaak · Accepted Answer

One way to do it (though it will require a massive amount of memory) is to make clever use of array broadcasting:

output = np.sqrt( np.sum( (A[:,:,np.newaxis,:] - B[:,np.newaxis,:,:])**2, axis=-1) )

Edit

But after some testing, it seems that probably scikit-learn's euclidean_distances is the best option for large arrays. (Note that I've rewritten your loop into a list comprehension.)

This is for 100 data points per day:

# your own code using cdist
from scipy.spatial.distance import cdist
%timeit dists1 = np.asarray([cdist(x,y) for x, y in zip(A, B)])

100 loops, best of 3: 8.81 ms per loop

# pure numpy with broadcasting
%timeit dists2 = np.sqrt( np.sum( (A[:,:,np.newaxis,:] - B[:,np.newaxis,:,:])**2, axis=-1) )

10 loops, best of 3: 46.9 ms per loop

# scikit-learn's algorithm
from sklearn.metrics.pairwise import euclidean_distances
%timeit dists3 = np.asarray([euclidean_distances(x,y) for x, y in zip(A, B)])
100 loops, best of 3: 12.6 ms per loop

and this is for 2000 data points per day:

In [5]: %timeit dists1 = np.asarray([cdist(x,y) for x, y in zip(A, B)])
1 loops, best of 3: 3.07 s per loop

In [7]: %timeit dists3 = np.asarray([euclidean_distances(x,y) for x, y in zip(A, B)])

1 loops, best of 3: 2.94 s per loop

or1426 · Answer

Edit: I'm an idiot and forgot that python's map is evaluated lazily. My "faster" code wasn't actually doing any of the work! Forcing evaluation removed the performance boost.

I think your time is going to be dominated by the time spent inside the scipy function. I'd use map instead of the loop anyway as I think its a bit neater but I don't think theres any magic way to get a huge performance boost here. Maybe compiling the code with cython or using numba would help a little.

Fast distance calculation in scipy and numpy

Tags:

performance

python

multidimensional-array

numpy

scipy

fact

2 Answers

EelkeSpaak

or1426

Recent Activity

Donate For Us

Fast distance calculation in scipy and numpy

Tags:

performance

python

multidimensional-array

numpy

scipy

fact

2 Answers

EelkeSpaak

or1426

Related questions

Recent Activity

Donate For Us