I have an array 3xN of 3d coordinates and I would like to efficiently calculate a distance matrix of all entries. Is there any efficient loop strategy rather than the nested loop one could apply?
Current pseudocode implementation:
for i,coord in enumerate(coords):
for j,coords2 in enumerate(coords):
if i != j:
dist[i,j] = numpy.norm(coord - coord2)
Because the Numpy array is densely packed in memory due to its homogeneous type, it also frees the memory faster. So overall a task executed in Numpy is around 5 to 100 times faster than the standard python list, which is a significant leap in terms of speed.
Let's run the program and see what we get. The output may look like the one below. The NumPy version is faster. It took roughly one-hundredth of the time for-loops took.
To limit the values of the NumPy array ndarray to given range, use np. clip() or clip() method of ndarray . By specifying the minimum and maximum values in the argument, the out-of-range values are replaced with those values. This is useful when you want to limit the values to a range such as 0.0 ~ 1.0 or 0 ~ 255 .
Clip (limit) the values in an array. Given an interval, values outside the interval are clipped to the interval edges. For example, if an interval of [0, 1] is specified, values smaller than 0 become 0, and values larger than 1 become 1.
To reproduce your results exactly:
>>> import scipy.spatial as sp
>>> import numpy as np
>>> a=np.random.rand(5,3) #Note this is the transpose of your array.
>>> a
array([[ 0.83921304, 0.72659404, 0.50434178], #0
[ 0.99883826, 0.91739731, 0.9435401 ], #1
[ 0.94327962, 0.57665875, 0.85853404], #2
[ 0.30053567, 0.44458829, 0.35677649], #3
[ 0.01345765, 0.49247883, 0.11496977]]) #4
>>> sp.distance.cdist(a,a)
array([[ 0. , 0.50475862, 0.39845025, 0.62568048, 0.94249268],
[ 0.50475862, 0. , 0.35554966, 1.02735895, 1.35575051],
[ 0.39845025, 0.35554966, 0. , 0.82602847, 1.1935422 ],
[ 0.62568048, 1.02735895, 0.82602847, 0. , 0.3783884 ],
[ 0.94249268, 1.35575051, 1.1935422 , 0.3783884 , 0. ]])
To do it more efficiently without duplicating calculations and only calculate unique pairs:
>>> sp.distance.pdist(a)
array([ 0.50475862, 0.39845025, 0.62568048, 0.94249268, 0.35554966,
1.02735895, 1.35575051, 0.82602847, 1.1935422 , 0.3783884 ])
#pairs: [(0, 1), (0, 2), (0, 3), (0, 4), (1, 2), (1, 3), (1, 4), (2, 3),
# (2, 4), (3, 4)]
Note the relationship between the two arrays. The cdist
array can be reproduced by:
>>> out=np.zeros((a.shape[0],a.shape[0]))
>>> dists=sp.distance.pdist(a)
>>> out[np.triu_indices(a.shape[0],1)]=dists
>>> out+=out.T
>>> out
array([[ 0. , 0.50475862, 0.39845025, 0.62568048, 0.94249268],
[ 0.50475862, 0. , 0.35554966, 1.02735895, 1.35575051],
[ 0.39845025, 0.35554966, 0. , 0.82602847, 1.1935422 ],
[ 0.62568048, 1.02735895, 0.82602847, 0. , 0.3783884 ],
[ 0.94249268, 1.35575051, 1.1935422 , 0.3783884 , 0. ]])
Some somewhat surprising timings-
The setup:
def pdist_toarray(a):
out=np.zeros((a.shape[0],a.shape[0]))
dists=sp.distance.pdist(a)
out[np.triu_indices(a.shape[0],1)]=dists
return out+out.T
def looping(a):
out=np.zeros((a.shape[0],a.shape[0]))
for i in xrange(a.shape[0]):
for j in xrange(a.shape[0]):
out[i,j]=np.linalg.norm(a[i]-a[j])
return out
Timings:
arr=np.random.rand(1000,3)
%timeit sp.distance.pdist(arr)
100 loops, best of 3: 4.26 ms per loop
%timeit sp.distance.cdist(arr,arr)
100 loops, best of 3: 9.31 ms per loop
%timeit pdist_toarray(arr)
10 loops, best of 3: 66.2 ms per loop
%timeit looping(arr)
1 loops, best of 3: 16.7 s per loop
So if you want the square array back you should use cdist
if you just want the pairs use pdist
. Looping is ~4000x slower for an array with 1000 elements and ~70x slower for an array with 10 elements compared to cdist
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With