I need to do a few hundred million euclidean distance calculations every day in a Python project.
Here is what I started out with:
def euclidean_dist_square(x, y):
diff = np.array(x) - np.array(y)
return np.dot(diff, diff)
This is quite fast and I already dropped the sqrt calculation since I need to rank items only (nearest-neighbor search). It is still the bottleneck of the script though. Therefore I have written a C extension, which calculates the distance. The calculation is always done with 128-dimensional vectors.
#include "euclidean.h"
#include <math.h>
double euclidean(double x[128], double y[128])
{
double Sum;
for(int i=0;i<128;i++)
{
Sum = Sum + pow((x[i]-y[i]),2.0);
}
return Sum;
}
Complete code for the extension is here: https://gist.github.com/herrbuerger/bd63b73f3c5cf1cd51de
Now this gives a nice speedup in comparison to the numpy version.
But is there any way to speed this up further (this is my first C extension ever so I assume there is)? With the number of times this function is used every day, every microsecond would actually provide a benefit.
Some of you might suggest porting this completely from Python to another language, unfortunately this is a larger project and not an option :(
Thanks.
Edit
I have posted this question on CodeReview: https://codereview.stackexchange.com/questions/52218/possible-optimizations-for-calculating-squared-euclidean-distance
I will delete this question in an hour in case someone has started to write an answer.
The fastest way to compute Euclidean distances in NumPy that I know is the one in scikit-learn, which can be summed up as
def squared_distances(X, Y):
"""Return a distance matrix for each pair of rows i, j in X, Y."""
# http://stackoverflow.com/a/19094808/166749
X_row_norms = np.einsum('ij,ij->i', X, X)
Y_row_norms = np.einsum('ij,ij->i', Y, Y)
distances = np.dot(X, Y)
distances *= -2
distances += X_row_norms
distances += Y_row_norms
np.maximum(distances, 0, distances) # get rid of negatives; optional
return distances
The bottleneck in this piece of code is matrix multiplication (np.dot
), so make sure your NumPy is linked against a good BLAS implementation; with a multithreaded BLAS on a multicore machine and large enough input matrices, it should be faster than anything you can whip up in C. Note that it relies on the binomial formula
||x - y||² = ||x||² + ||y||² - 2 x⋅y
and that either X_row_norms
or Y_row_norms
can be cached across invocations for the k-NN use case.
(I'm a coauthor of this code, and I spent quite some time optimizing both it and the SciPy implementation; scikit-learn is faster at the expense of some accuracy, but for k-NN that shouldn't matter too much. The SciPy implementation, available in scipy.spatial.distance
, is actually an optimized version of the code you just wrote and is more accurate.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With