I have an array 3xN of 3d coordinates and I would like to efficiently calculate a distance matrix of all entries. Is there any efficient loop strategy rather than the nested loop one could apply? Current pseudocode implementation: <pre class="prettyprint"><code>for i,coord in enumerate(coords): for j,coords2 in enumerate(coords): if i != j: dist[i,j] = numpy.norm(coord - coord2) </code></pre>

To reproduce your results exactly: <pre class="prettyprint"><code>>>> import scipy.spatial as sp >>> import numpy as np >>> a=np.random.rand(5,3) #Note this is the transpose of your array. >>> a array([[ 0.83921304, 0.72659404, 0.50434178], #0 [ 0.99883826, 0.91739731, 0.9435401 ], #1 [ 0.94327962, 0.57665875, 0.85853404], #2 [ 0.30053567, 0.44458829, 0.35677649], #3 [ 0.01345765, 0.49247883, 0.11496977]]) #4 >>> sp.distance.cdist(a,a) array([[ 0. , 0.50475862, 0.39845025, 0.62568048, 0.94249268], [ 0.50475862, 0. , 0.35554966, 1.02735895, 1.35575051], [ 0.39845025, 0.35554966, 0. , 0.82602847, 1.1935422 ], [ 0.62568048, 1.02735895, 0.82602847, 0. , 0.3783884 ], [ 0.94249268, 1.35575051, 1.1935422 , 0.3783884 , 0. ]]) </code></pre> To do it more efficiently without duplicating calculations and only calculate unique pairs: <pre class="prettyprint"><code>>>> sp.distance.pdist(a) array([ 0.50475862, 0.39845025, 0.62568048, 0.94249268, 0.35554966, 1.02735895, 1.35575051, 0.82602847, 1.1935422 , 0.3783884 ]) #pairs: [(0, 1), (0, 2), (0, 3), (0, 4), (1, 2), (1, 3), (1, 4), (2, 3), # (2, 4), (3, 4)] </code></pre> Note the relationship between the two arrays. The <code>cdist</code> array can be reproduced by: <pre class="prettyprint"><code>>>> out=np.zeros((a.shape[0],a.shape[0])) >>> dists=sp.distance.pdist(a) >>> out[np.triu_indices(a.shape[0],1)]=dists >>> out+=out.T >>> out array([[ 0. , 0.50475862, 0.39845025, 0.62568048, 0.94249268], [ 0.50475862, 0. , 0.35554966, 1.02735895, 1.35575051], [ 0.39845025, 0.35554966, 0. , 0.82602847, 1.1935422 ], [ 0.62568048, 1.02735895, 0.82602847, 0. , 0.3783884 ], [ 0.94249268, 1.35575051, 1.1935422 , 0.3783884 , 0. ]]) </code></pre> <hr> Some somewhat surprising timings- The setup: <pre class="prettyprint"><code>def pdist_toarray(a): out=np.zeros((a.shape[0],a.shape[0])) dists=sp.distance.pdist(a) out[np.triu_indices(a.shape[0],1)]=dists return out+out.T def looping(a): out=np.zeros((a.shape[0],a.shape[0])) for i in xrange(a.shape[0]): for j in xrange(a.shape[0]): out[i,j]=np.linalg.norm(a[i]-a[j]) return out </code></pre> Timings: <pre class="prettyprint"><code>arr=np.random.rand(1000,3) %timeit sp.distance.pdist(arr) 100 loops, best of 3: 4.26 ms per loop %timeit sp.distance.cdist(arr,arr) 100 loops, best of 3: 9.31 ms per loop %timeit pdist_toarray(arr) 10 loops, best of 3: 66.2 ms per loop %timeit looping(arr) 1 loops, best of 3: 16.7 s per loop </code></pre> So if you want the square array back you should use <code>cdist</code> if you just want the pairs use <code>pdist</code>. Looping is ~4000x slower for an array with 1000 elements and ~70x slower for an array with 10 elements compared to <code>cdist</code>.

Numpy efficient one against all

Tags:

python

numpy

I have an array 3xN of 3d coordinates and I would like to efficiently calculate a distance matrix of all entries. Is there any efficient loop strategy rather than the nested loop one could apply?

Current pseudocode implementation:

for i,coord in enumerate(coords):
    for j,coords2 in enumerate(coords):
        if i != j:
             dist[i,j] = numpy.norm(coord - coord2)

633

asked Aug 30 '13 16:08

El Dude

1 Answers

To reproduce your results exactly:

>>> import scipy.spatial as sp
>>> import numpy as np
>>> a=np.random.rand(5,3) #Note this is the transpose of your array.
>>> a
array([[ 0.83921304,  0.72659404,  0.50434178],  #0
       [ 0.99883826,  0.91739731,  0.9435401 ],  #1
       [ 0.94327962,  0.57665875,  0.85853404],  #2
       [ 0.30053567,  0.44458829,  0.35677649],  #3
       [ 0.01345765,  0.49247883,  0.11496977]]) #4
>>> sp.distance.cdist(a,a)
array([[ 0.        ,  0.50475862,  0.39845025,  0.62568048,  0.94249268],
       [ 0.50475862,  0.        ,  0.35554966,  1.02735895,  1.35575051],
       [ 0.39845025,  0.35554966,  0.        ,  0.82602847,  1.1935422 ],
       [ 0.62568048,  1.02735895,  0.82602847,  0.        ,  0.3783884 ],
       [ 0.94249268,  1.35575051,  1.1935422 ,  0.3783884 ,  0.        ]])

To do it more efficiently without duplicating calculations and only calculate unique pairs:

>>> sp.distance.pdist(a)
array([ 0.50475862,  0.39845025,  0.62568048,  0.94249268,  0.35554966,
        1.02735895,  1.35575051,  0.82602847,  1.1935422 ,  0.3783884 ])
#pairs: [(0, 1), (0, 2), (0, 3), (0, 4), (1, 2), (1, 3), (1, 4), (2, 3),
#         (2, 4), (3, 4)]

Note the relationship between the two arrays. The cdist array can be reproduced by:

>>> out=np.zeros((a.shape[0],a.shape[0]))
>>> dists=sp.distance.pdist(a)
>>> out[np.triu_indices(a.shape[0],1)]=dists
>>> out+=out.T

>>> out
array([[ 0.        ,  0.50475862,  0.39845025,  0.62568048,  0.94249268],
       [ 0.50475862,  0.        ,  0.35554966,  1.02735895,  1.35575051],
       [ 0.39845025,  0.35554966,  0.        ,  0.82602847,  1.1935422 ],
       [ 0.62568048,  1.02735895,  0.82602847,  0.        ,  0.3783884 ],
       [ 0.94249268,  1.35575051,  1.1935422 ,  0.3783884 ,  0.        ]])

Some somewhat surprising timings-

The setup:

def pdist_toarray(a):
    out=np.zeros((a.shape[0],a.shape[0]))
    dists=sp.distance.pdist(a)

    out[np.triu_indices(a.shape[0],1)]=dists
    return out+out.T

def looping(a):
    out=np.zeros((a.shape[0],a.shape[0]))
    for i in xrange(a.shape[0]):
        for j in xrange(a.shape[0]):
            out[i,j]=np.linalg.norm(a[i]-a[j])
    return out

Timings:

arr=np.random.rand(1000,3)

%timeit sp.distance.pdist(arr)
100 loops, best of 3: 4.26 ms per loop

%timeit sp.distance.cdist(arr,arr)
100 loops, best of 3: 9.31 ms per loop

%timeit pdist_toarray(arr)
10 loops, best of 3: 66.2 ms per loop

%timeit looping(arr)
1 loops, best of 3: 16.7 s per loop

So if you want the square array back you should use cdist if you just want the pairs use pdist. Looping is ~4000x slower for an array with 1000 elements and ~70x slower for an array with 10 elements compared to cdist.

135

answered Oct 12 '22 08:10

Daniel

Related questions
                            
                                List of Lists of Object referencing same object in Python
                            
                                Vectorizing feature hashing in python
                            
                                Binding class method to a Tkinter signal
                            
                                Where should my Python 3 modules be?
                            
                                Can't upgrade matplotlib in Ubuntu 12.04 with Canopy installed
                            
                                Pandas fillna: Output still has NaN values
                            
                                using regular expression substitution command to insert leading zeros in front of numbers less than 10 in a string of filenames
                            
                                How do I override a method object's __call__ method in Python? [duplicate]
                            
                                Database disk image is malformed from many concurrent writes
                            
                                Memory leak when invoking __iadd__ via __get__ without using temporary
                            
                                "set session" in a SQLAlchemy session object
                            
                                Python setup.py call makefile don't include binaries
                            
                                Sort a list of dictionaries while consolidating duplicates in Python?
                            
                                Cython Pickling in Package "not found as" Error
                            
                                Which IDE for scientific computing and plotting in Python? [closed]
                            
                                Problems using User model in django unit tests
                            
                                Local variable referenced before assignment in Python
                            
                                Python best way to check for existing key
                            
                                Matplotlib coord. sys origin to top left
                            
                                Number of digits in exponent

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With