<code>scipy.spatial.distance.pdist</code> returns a condensed distance matrix. From the documentation: <blockquote> Returns a condensed distance matrix Y. For each and (where ), the metric dist(u=X[i], v=X[j]) is computed and stored in entry ij. </blockquote> I thought <code>ij</code> meant <code>i*j</code>. But I think I might be wrong. Consider <pre class="prettyprint lang-py prettyprint-override"><code>X = array([[1,2], [1,2], [3,4]]) dist_matrix = pdist(X) </code></pre> then the documentation says that <code>dist(X[0], X[2])</code> should be <code>dist_matrix[0*2]</code>. However, <code>dist_matrix[0*2]</code> is 0 -- not 2.8 as it should be. What's the formula I should use to access the similarity of a two vectors, given <code>i</code> and <code>j</code>?

You can look at it this way: Suppose <code>x</code> is m by n. The possible pairs of <code>m</code> rows, chosen two at a time, is <code>itertools.combinations(range(m), 2)</code>, e.g, for <code>m=3</code>: <pre class="prettyprint"><code>>>> import itertools >>> list(combinations(range(3),2)) [(0, 1), (0, 2), (1, 2)] </code></pre> So if <code>d = pdist(x)</code>, the <code>k</code>th tuple in <code>combinations(range(m), 2))</code> gives the indices of the rows of <code>x</code> associated with <code>d[k]</code>. Example: <pre class="prettyprint"><code>>>> x = array([[0,10],[10,10],[20,20]]) >>> pdist(x) array([ 10. , 22.36067977, 14.14213562]) </code></pre> The first element is <code>dist(x[0], x[1])</code>, the second is <code>dist(x[0], x[2])</code> and the third is <code>dist(x[1], x[2])</code>. Or you can view it as the elements in the upper triangular part of the square distance matrix, strung together into a 1D array. E.g. <pre class="prettyprint"><code>>>> squareform(pdist(x)) array([[ 0. , 10. , 22.361], [ 10. , 0. , 14.142], [ 22.361, 14.142, 0. ]]) >>> y = array([[0,10],[10,10],[20,20],[10,0]]) >>> squareform(pdist(y)) array([[ 0. , 10. , 22.361, 14.142], [ 10. , 0. , 14.142, 10. ], [ 22.361, 14.142, 0. , 22.361], [ 14.142, 10. , 22.361, 0. ]]) >>> pdist(y) array([ 10. , 22.361, 14.142, 14.142, 10. , 22.361]) </code></pre>

How does condensed distance matrix work? (pdist)

Tags:

python

numpy

scipy

scipy.spatial.distance.pdist returns a condensed distance matrix. From the documentation:

Returns a condensed distance matrix Y. For each and (where ), the metric dist(u=X[i], v=X[j]) is computed and stored in entry ij.

I thought ij meant i*j. But I think I might be wrong. Consider

X = array([[1,2], [1,2], [3,4]]) dist_matrix = pdist(X)

then the documentation says that dist(X[0], X[2]) should be dist_matrix[0*2]. However, dist_matrix[0*2] is 0 -- not 2.8 as it should be.

What's the formula I should use to access the similarity of a two vectors, given i and j?

828

asked Oct 26 '12 01:10

Rafael Almeida

2 Answers

You can look at it this way: Suppose x is m by n. The possible pairs of m rows, chosen two at a time, is itertools.combinations(range(m), 2), e.g, for m=3:

>>> import itertools >>> list(combinations(range(3),2)) [(0, 1), (0, 2), (1, 2)]

So if d = pdist(x), the kth tuple in combinations(range(m), 2)) gives the indices of the rows of x associated with d[k].

Example:

>>> x = array([[0,10],[10,10],[20,20]]) >>> pdist(x) array([ 10.        ,  22.36067977,  14.14213562])

The first element is dist(x[0], x[1]), the second is dist(x[0], x[2]) and the third is dist(x[1], x[2]).

Or you can view it as the elements in the upper triangular part of the square distance matrix, strung together into a 1D array.

E.g.

>>> squareform(pdist(x))  array([[  0.   ,  10.   ,  22.361],        [ 10.   ,   0.   ,  14.142],        [ 22.361,  14.142,   0.   ]])  >>> y = array([[0,10],[10,10],[20,20],[10,0]]) >>> squareform(pdist(y))  array([[  0.   ,  10.   ,  22.361,  14.142],        [ 10.   ,   0.   ,  14.142,  10.   ],        [ 22.361,  14.142,   0.   ,  22.361],        [ 14.142,  10.   ,  22.361,   0.   ]]) >>> pdist(y) array([ 10.   ,  22.361,  14.142,  14.142,  10.   ,  22.361])

181

answered Sep 19 '22 09:09

Warren Weckesser

Condensed distance matrix to full distance matrix

A condensed distance matrix as returned by pdist can be converted to a full distance matrix by using scipy.spatial.distance.squareform:

>>> import numpy as np >>> from scipy.spatial.distance import pdist, squareform >>> points = np.array([[0,1],[1,1],[3,5], [15, 5]]) >>> dist_condensed = pdist(points) >>> dist_condensed array([  1.        ,   5.        ,  15.5241747 ,   4.47213595,         14.56021978,  12.        ])

Use squareform to convert to full matrix:

>>> dist = squareform(dist_condensed) array([[  0.        ,   1.        ,   5.        ,  15.5241747 ],        [  1.        ,   0.        ,   4.47213595,  14.56021978],        [  5.        ,   4.47213595,   0.        ,  12.        ],        [ 15.5241747 ,  14.56021978,  12.        ,   0.        ]])

Distance between point i,j is stored in dist[i, j]:

>>> dist[2, 0] 5.0 >>> np.linalg.norm(points[2] - points[0]) 5.0

Indices to condensed index

One can convert indices used for accessing the elements of the square matrix to the index in the condensed matrix:

def square_to_condensed(i, j, n):     assert i != j, "no diagonal elements in condensed matrix"     if i < j:         i, j = j, i     return n*j - j*(j+1)//2 + i - 1 - j

Example:

>>> square_to_condensed(1, 2, len(points)) 3 >>> dist_condensed[3] 4.4721359549995796 >>> dist[1,2] 4.4721359549995796

Condensed index to indices

Also the other direction is possible without sqaureform which is better in terms of runtime and memory consumption:

import math  def calc_row_idx(k, n):     return int(math.ceil((1/2.) * (- (-8*k + 4 *n**2 -4*n - 7)**0.5 + 2*n -1) - 1))  def elem_in_i_rows(i, n):     return i * (n - 1 - i) + (i*(i + 1))//2  def calc_col_idx(k, i, n):     return int(n - elem_in_i_rows(i + 1, n) + k)  def condensed_to_square(k, n):     i = calc_row_idx(k, n)     j = calc_col_idx(k, i, n)     return i, j

Example:

>>> condensed_to_square(3, 4) (1.0, 2.0)

Runtime comparison with squareform

>>> import numpy as np >>> from scipy.spatial.distance import pdist, squareform >>> points = np.random.random((10**4,3)) >>> %timeit dist_condensed = pdist(points) 1 loops, best of 3: 555 ms per loop

Creating the sqaureform turns out to be really slow:

>>> dist_condensed = pdist(points) >>> %timeit dist = squareform(dist_condensed) 1 loops, best of 3: 2.25 s per loop

If we are searching two points with maximum distance it is not surprising that searching in full matrix is O(n) while in condensed form only O(n/2):

>>> dist = squareform(dist_condensed) >>> %timeit dist_condensed.argmax() 10 loops, best of 3: 45.2 ms per loop >>> %timeit dist.argmax() 10 loops, best of 3: 93.3 ms per loop

Getting the inideces for the two points takes almost no time in both cases but of course there is some overhead for calculating the condensed index:

>>> idx_flat = dist.argmax() >>> idx_condensed = dist.argmax() >>> %timeit list(np.unravel_index(idx_flat, dist.shape)) 100000 loops, best of 3: 2.28 µs per loop >>> %timeit condensed_to_square(idx_condensed, len(points)) 100000 loops, best of 3: 14.7 µs per loop

answered Sep 23 '22 09:09

lumbric

Related questions
                            
                                python pip - install from local dir
                            
                                Remove Last Path Component In a String
                            
                                List running Jupyter notebooks and tokens
                            
                                What is the best way for checking if the user of a script has root-like privileges?
                            
                                Split a string with unknown number of spaces as separator in Python
                            
                                Catching DoesNotExist exception in a custom manager in Django
                            
                                Two's Complement in Python
                            
                                How do I copy a directory to a remote machine using Fabric?
                            
                                Python daemon and systemd service
                            
                                Can i cycle through line styles in matplotlib
                            
                                Get class labels from Keras functional model
                            
                                group by week in pandas
                            
                                Does R have function startswith or endswith like python? [closed]
                            
                                create destination path for shutil.copy files
                            
                                Getting min and max Dates from a pandas dataframe
                            
                                How to replace only part of the match with python re.sub
                            
                                Append multiple pandas data frames at once
                            
                                Unable to perform collectstatic
                            
                                how to sort by length of string followed by alphabetical order?
                            
                                How could I use batch normalization in TensorFlow?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does condensed distance matrix work? (pdist)

Tags:

python

numpy

scipy

Rafael Almeida

People also ask

2 Answers

Warren Weckesser

Condensed distance matrix to full distance matrix

Indices to condensed index

Condensed index to indices

Runtime comparison with squareform

lumbric

Recent Activity

Donate For Us