I am new to Python so this question might look trivia. However, I did not find a similar case to mine. I have a matrix of coordinates for 20 nodes. I want to compute the euclidean distance between all pairs of nodes from this set and store them in a pairwise matrix. For example, If I have 20 nodes, I want the end result to be a matrix of (20,20) with values of euclidean distance between each pairs of nodes. I tried to used a for loop to go through each element of the coordinate set and compute euclidean distance as follows:
ncoord=numpy.matrix('3225 318;2387 989;1228 2335;57 1569;2288 8138;3514 2350;7936 314;9888 4683;6901 1834;7515 8231;709 3701;1321 8881;2290 2350;5687 5034;760 9868;2378 7521;9025 5385;4819 5943;2917 9418;3928 9770')
n=20
c=numpy.zeros((n,n))
for i in range(0,n):
for j in range(i+1,n):
c[i][j]=math.sqrt((ncoord[i][0]-ncoord[j][0])**2+(ncoord[i][1]-ncoord[j][1])**2)
How ever, I am getting an error of "input must be a square array ". I wonder if anybody knows what is happening here. Thanks
How to Calculate Euclidean Distance in Python? The formula to calculate the distance between two points (x1 1 , y1 1 ) and (x2 2 , y2 2 ) is d = √[(x2 – x1)2 + (y2 – y1)2].
In this method, we first initialize two numpy arrays. Then, we take the difference of the two arrays, compute the dot product of the result, and transpose of the result. Then we take the square root of the answer. This is another way to implement Euclidean distance.
dist() method in Python is used to the Euclidean distance between two points p and q, each given as a sequence (or iterable) of coordinates. The two points must have the same dimension. This method is new in Python version 3.8. Returns: the calculated Euclidean distance between the given points.
There are much, much faster alternatives to using nested for
loops for this. I'll show you two different approaches - the first will be a more general method that will introduce you to broadcasting and vectorization, and the second uses a more convenient scipy library function.
One of the first things I'd suggest doing is switching to using np.array
rather than np.matrix
. Arrays are preferred for a number of reasons, most importantly because they can have >2 dimensions, and they make element-wise multiplication much less awkward.
import numpy as np
ncoord = np.array(ncoord)
With an array, we can eliminate the nested for
loops by inserting a new singleton dimension and broadcasting the subtraction over it:
# indexing with None (or np.newaxis) inserts a new dimension of size 1
print(ncoord[:, :, None].shape)
# (20, 2, 1)
# by making the 'inner' dimensions equal to 1, i.e. (20, 2, 1) - (1, 2, 20),
# the subtraction is 'broadcast' over every pair of rows in ncoord
xydiff = ncoord[:, :, None] - ncoord[:, :, None].T
print(xydiff.shape)
# (20, 2, 20)
This is equivalent to looping over every pair of rows using nested for loops, but much, much faster!
xydiff2 = np.zeros((20, 2, 20), dtype=xydiff.dtype)
for ii in range(20):
for jj in range(20):
for kk in range(2):
xydiff[ii, kk, jj] = ncoords[ii, kk] - ncoords[jj, kk]
# check that these give the same result
print(np.all(xydiff == xydiff2))
# True
The rest we can also do using vectorized operations:
# we square the differences and sum over the 'middle' axis, equivalent to
# computing (x_i - x_j) ** 2 + (y_i - y_j) ** 2
ssdiff = (xydiff * xydiff).sum(1)
# finally we take the square root
D = np.sqrt(ssdiff)
The whole thing could be done in one line like this:
D = np.sqrt(((ncoord[:, :, None] - ncoord[:, :, None].T) ** 2).sum(1))
pdist
It turns out that there's already a fast and convenient function for computing all pairwise distances: scipy.spatial.distance.pdist
.
from scipy.spatial.distance import pdist, squareform
d = pdist(ncoord)
# pdist just returns the upper triangle of the pairwise distance matrix. to get
# the whole (20, 20) array we can use squareform:
print(d.shape)
# (190,)
D2 = squareform(d)
print(D2.shape)
# (20, 20)
# check that the two methods are equivalent
print np.all(D == D2)
# True
for i in range(0, n):
for j in range(i+1, n):
c[i, j] = math.sqrt((ncoord[i, 0] - ncoord[j, 0])**2
+ (ncoord[i, 1] - ncoord[j, 1])**2)
Note: ncoord[i, j]
is not the same as ncoord[i][j]
for a Numpy matrix. This appears to be the source of confusion. If ncoord
is a Numpy array then they will give the same result.
For a Numpy matrix, ncoord[i]
returns the ith row of ncoord
, which itself is a Numpy matrix object with shape 1 x 2 in your case. Therefore, ncoord[i][j]
actually means: take the ith row of ncoord
and take the jth row of that 1 x 2 matrix. This is where your indexing problems comes about when j
> 0.
Regarding your comments on assigning to c[i][j]
"working", it shouldn't. At least on my build of Numpy 1.9.1 it shouldn't work if your indices i
and j
iterates up to n
.
As an aside, remember to add the transpose of the matrix c
to itself.
It is recommended to use Numpy arrays instead of matrix. See this post.
If your coordinates are stored as a Numpy array, then pairwise distance can be computed as:
from scipy.spatial.distance import pdist
pairwise_distances = pdist(ncoord, metric="euclidean", p=2)
or simply
pairwise_distances = pdist(ncoord)
since the default metric is "euclidean", and default "p" is 2.
In a comment below I mistakenly mentioned that the result of pdist is a n x n matrix. To get a n x n matrix, you will need to do the following:
from scipy.spatial.distance import pdist, squareform
pairwise_distances = squareform(pdist(ncoord))
or
from scipy.spatial.distance import cdist
pairwise_distances = cdist(ncoord, ncoord)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With