Fastest pairwise distance metric in python

Tags:

I have an 1D array of numbers, and want to calculate all pairwise euclidean distances. I have a method (thanks to SO) of doing this with broadcasting, but it's inefficient because it calculates each distance twice. And it doesn't scale well.

Here's an example that gives me what I want with an array of 1000 numbers.

import numpy as np
import random
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
dists = np.abs(r - r[:, None])

What's the fastest implementation in scipy/numpy/scikit-learn that I can use to do this, given that it has to scale to situations where the 1D array has >10k values.

Note: the matrix is symmetric, so I'm guessing that it's possible to get at least a 2x speedup by addressing that, I just don't know how.

422

asked Nov 29 '13 03:11

roblanf

3 Answers

Neither of the other answers quite answered the question - 1 was in Cython, one was slower. But both provided very useful hints. Following up on them suggests that scipy.spatial.distance.pdist is the way to go.

Here's some code:

import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance

r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]

def option1(r):
    dists = np.abs(r - r[:, None])

def option2(r):
    dists = scipy.spatial.distance.pdist(r, 'cityblock')

def option3(r):
    dists = sklearn.metrics.pairwise.manhattan_distances(r)

Timing with IPython:

In [36]: timeit option1(r)
100 loops, best of 3: 5.31 ms per loop

In [37]: timeit option2(c)
1000 loops, best of 3: 1.84 ms per loop

In [38]: timeit option3(c)
100 loops, best of 3: 11.5 ms per loop

I didn't try the Cython implementation (I can't use it for this project), but comparing my results to the other answer that did, it looks like scipy.spatial.distance.pdist is roughly a third slower than the Cython implementation (taking into account the different machines by benchmarking on the np.abs solution).

answered Oct 17 '22 01:10

roblanf

Here is a Cython implementation that gives more than 3X speed improvement for this example on my computer. This timing should be reviewed for bigger arrays tough, because the BLAS routines can probably scale much better than this rather naive code.

I know you asked for something inside scipy/numpy/scikit-learn, but maybe this will open new possibilities for you:

File my_cython.pyx:

import numpy as np
cimport numpy as np
import cython

cdef extern from "math.h":
    double abs(double t)

@cython.wraparound(False)
@cython.boundscheck(False)
def pairwise_distance(np.ndarray[np.double_t, ndim=1] r):
    cdef int i, j, c, size
    cdef np.ndarray[np.double_t, ndim=1] ans
    size = sum(range(1, r.shape[0]+1))
    ans = np.empty(size, dtype=r.dtype)
    c = -1
    for i in range(r.shape[0]):
        for j in range(i, r.shape[0]):
            c += 1
            ans[c] = abs(r[i] - r[j])
    return ans

The answer is a 1-D array containing all non-repeated evaluations.

To import into Python:

import numpy as np
import random

import pyximport; pyximport.install()
from my_cython import pairwise_distance

r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)], dtype=float)

def solOP(r):
    return np.abs(r - r[:, None])

Timing with IPython:

In [2]: timeit solOP(r)
100 loops, best of 3: 7.38 ms per loop

In [3]: timeit pairwise_distance(r)
1000 loops, best of 3: 1.77 ms per loop

answered Oct 17 '22 01:10

Saullo G. P. Castro

Using half the memory, but 6 times slower than np.abs(r - r[:, None]):

triu = np.triu_indices(r.shape[0],1)
dists2 = abs(r[triu[1]]-r[triu[0]])

answered Oct 17 '22 03:10

cyborg

Related questions
                            
                                Get unique values of multiple columns as a new dataframe in pandas
                            
                                Access a particular field in arbitrarily nested JSON data [duplicate]
                            
                                How to select rows in Pandas dataframe where value appears more than once
                            
                                Proper way to return mocked object using pytest.fixture
                            
                                percentage difference between two text files
                            
                                Django project models.py versus app models.py
                            
                                Using RSA in Python
                            
                                Python/Django: Adding custom model methods?
                            
                                geodjango using mysql
                            
                                Python 2.7 try and except ValueError
                            
                                Can a python script execute a function inside a bash script?
                            
                                Python Regex to Parse String and Return Tuple
                            
                                using index() on multidimensional lists
                            
                                Introspect calling object
                            
                                Python good programming practice for enumerating lists
                            
                                python nested classes
                            
                                Django model: Email field unique if not null/blank
                            
                                How to specify 2 keys in python sorted(list)?
                            
                                Determine free RAM in Python
                            
                                python reduce to find the union of sets

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fastest pairwise distance metric in python

Tags:

python

arrays

numpy

scipy

scikit-learn

roblanf

People also ask

3 Answers

roblanf

Saullo G. P. Castro

cyborg

Recent Activity

Donate For Us