Using Numpy to find the average distance in a set of points

Question

I have an array of points in unknown dimensional space, such as:

data=numpy.array(
[[ 115, 241, 314],
[ 153, 413, 144],
[ 535, 2986, 41445]])

and I would like to find the average euclidean distance between all points.

Please note that I have over 20,000 points, so I would like to do this as efficiently as possible.

Thanks.

Justin Peel · Accepted Answer

Well, I don't think that there is a super fast way to do this, but this should do it:

tot = 0.

for i in xrange(data.shape[0]-1):
    tot += ((((data[i+1:]-data[i])**2).sum(1))**.5).sum()

avg = tot/((data.shape[0]-1)*(data.shape[0])/2.)

Nick · Answer

If you have access to scipy, you could try the following:

scipy.spatial.distance.cdist(data,data)

Michael Aaron Safyan · Answer

Now that you've stated your goal of finding the outliers, you are probably better off computing the sample mean and, with that, the sample variance, since both those operations will give you an O(nd) operation. With that, you should be able to find outliers (e.g. excluding points further from the mean than some fraction of the std. dev.), and that filtering process should be possible to perform in O(nd) time for a total of O(nd).

You might be interested in a refresher on Chebyshev's inequality.

doug · Answer

Is it ever worthwhile to optimize without a working solution? Also, computation of a distance matrix over the entire data set rarely needs to be fast because you only do it once--when you need to know a distance between two points, you just look it up, it's already calculated.

So if you don't have a place to start, here's one. If you want to do this in Numpy without the need to write any inline fortran or C, that should be no problem, though perhaps you want to include this small vector-based virtual machine called "numexpr" (available on PyPI, trivial to intall) which in this case gave a 5x performance boost versus Numpy alone.

Below i've calculated a distance matrix for 10,000 points in 2D space (a 10K x 10k matrix giving the distance between all 10k points). This took 59 seconds on my MBP.

import numpy as NP
import numexpr as NE

# data are points in 2D space (x, y)--obviously, this code can accept data of any dimension
x = NP.random.randint(0, 10, 10000)
y = NP.random.randint(0, 10, 10000)
fnx = lambda q : q - NP.reshape(q, (len(q), 1))
delX = fnx(x)
delY = fnx(y)
dist_mat = NE.evaluate("(delX**2 + delY**2)**0.5")

Using Numpy to find the average distance in a set of points

Tags:

performance

python

algorithm

numpy

distance

Mel Kaye

4 Answers

Justin Peel

Nick

Michael Aaron Safyan

doug

Recent Activity

Donate For Us

Using Numpy to find the average distance in a set of points

Tags:

performance

python

algorithm

numpy

distance

Mel Kaye

4 Answers

Justin Peel

Nick

Michael Aaron Safyan

doug

Related questions

Recent Activity

Donate For Us