Efficient calculation of euclidean distance

Tags:

I have a MxN array, where M is the number of observations and N is the dimensionality of each vector. From this array of vectors, I need to calculate the mean and minimum euclidean distance between the vectors.

In my mind, this requires me to calculate _MC₂ distances, which is an O(n^{min(k, n-k)}) algorithm. My M is ~10,000 and my N is ~1,000, and this computation takes ~45 seconds.

Is there a more efficient way to compute the mean and min distances? Perhaps a probabilistic method? I don't need it to be exact, just close.

417

asked Mar 19 '17 03:03

japata

2 Answers

You didn't describe where your vectors come from, nor what use you will put mean and median to. Here are some observations about the general case. Limited ranges, error tolerance, and discrete values may admit of a more efficient approach.

The mean distance between M points sounds quadratic, O(M^2). But M / N is 10, fairly small, and N is huge, so the data probably resembles a hairy sphere in 1e3-space. Computing centroid of M points, and then computing M distances to centroid, might turn out to be useful in your problem domain, hard to tell.

The minimum distance among M points is more interesting. Choose a small number of pairs at random, say 100, compute their distance, and take half the minimum as an estimate of the global minimum distance. (Validate by comparing to the next few smallest distances, if desired.) Now use spatial UB-tree to model each point as a positive integer. This involves finding N minima for M x N values, adding constants so min becomes zero, scaling so estimated global min distance corresponds to at least 1.0, and then truncating to integer.

With these transformed vectors in hand, we're ready to turn them into a UB-tree representation that we can sort, and then do nearest neighbor spatial queries on the sorted values. For each point compute an integer. Shift the low-order bit of each dimension's value into the result, then iterate. Continue iterating over all dimensions until non-zero bits have all been consumed and appear in the result, and proceed to the next point. Numerically sort the integer result values, yielding a data structure similar to a PostGIS index.

Now you have a discretized representation that supports reasonably efficient queries for nearest neighbors (though admittedly N=1e3 is inconveniently large). After finding two or more coarse-grained nearby neighbors, you can query the original vector representation to obtain high-resolution distances between them, for finer discrimination. If your data distribution turns out to have a large fraction of points that discretize to being off by single bit from nearest neighbor, e.g. location of oxygen atoms where each has a buddy, then increase the global min distance estimate so the low order bits offer adequate discrimination.

A similar discretization approach would be appropriately scaling e.g. 2-dimensional inputs and marking an initially empty grid, then scanning immediate neighborhoods. This relies on global min being within a "small" neighborhood, due to appropriate scaling. In your case you would be marking an N-dimensional grid.

answered Oct 07 '22 06:10

J_H

You may be able to speed things up with some sort of Space Partitioning.

For the minimum distance calculation, you would only need to consider pairs of points in the same or neigbouring partitions. For an approximate mean, you might be able to come up with some sort of weighted average based on the distances between partitions and the number of points within them.

answered Oct 07 '22 07:10

Martin Stone

Related questions
                            
                                Virtualenv OSError - setuptools pip wheel failed with error code 1
                            
                                Python recursion permutations
                            
                                How to install pytorch in windows?
                            
                                Why does my Python program average only 33% CPU per process? How can I make Python use all available CPU?
                            
                                how to set rmse cost function in tensorflow
                            
                                How to convert a list of multiple integers into a single integer?
                            
                                Python: wordcloud, repetitve words
                            
                                How do I sort this list in Python, if my date is in a String?
                            
                                There is no South database module 'south.db.postgresql_psycopg2' for your database
                            
                                Obtain the first part of an URL from Django template
                            
                                Inaccurate Logarithm in Python
                            
                                Simple Facebook Connect in Google App Engine (Python)
                            
                                Why does db.insert(dict) add _id key to the dict object while using pymongo
                            
                                pip install numpy doesn't work: "No matching distribution found"
                            
                                Parallel programming in python [duplicate]
                            
                                Pyinstaller image does not load

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficient calculation of euclidean distance

Tags:

python

algorithm

python-3.x

euclidean-distance

japata

People also ask

2 Answers

J_H

Martin Stone

Recent Activity

Donate For Us