Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does condensed distance matrix work? (pdist)

scipy.spatial.distance.pdist returns a condensed distance matrix. From the documentation:

Returns a condensed distance matrix Y. For each and (where ), the metric dist(u=X[i], v=X[j]) is computed and stored in entry ij.

I thought ij meant i*j. But I think I might be wrong. Consider

X = array([[1,2], [1,2], [3,4]]) dist_matrix = pdist(X) 

then the documentation says that dist(X[0], X[2]) should be dist_matrix[0*2]. However, dist_matrix[0*2] is 0 -- not 2.8 as it should be.

What's the formula I should use to access the similarity of a two vectors, given i and j?

like image 828
Rafael Almeida Avatar asked Oct 26 '12 01:10

Rafael Almeida


People also ask

How does Scipy Pdist work?

scipy. stats. pdist(array, axis=0) function calculates the Pairwise distances between observations in n-dimensional space. Returns : Pairwise distances of the array elements based on the set parameters.

How does a distance matrix work?

A distance matrix is a table that shows the distance between pairs of objects. For example, in the table below we can see a distance of 16 between A and B, of 47 between A and C, and so on. By definition, an object's distance from itself, which is shown in the main diagonal of the table, is 0.

What does Scipy spatial distance Pdist do?

Computes the normalized Hamming distance, or the proportion of those vector elements between two n-vectors u and v which disagree.

What is an Uncondensed distance matrix?

Since a distance matrix is usually computed between a set of points and itself (ie all pairs of elements in the set of points), it is symmetric and the diagonal is all zero. Squareform/condensed saves memory by only representing the upper triangular points, whereas uncondensed is the full matrix.


2 Answers

You can look at it this way: Suppose x is m by n. The possible pairs of m rows, chosen two at a time, is itertools.combinations(range(m), 2), e.g, for m=3:

>>> import itertools >>> list(combinations(range(3),2)) [(0, 1), (0, 2), (1, 2)] 

So if d = pdist(x), the kth tuple in combinations(range(m), 2)) gives the indices of the rows of x associated with d[k].

Example:

>>> x = array([[0,10],[10,10],[20,20]]) >>> pdist(x) array([ 10.        ,  22.36067977,  14.14213562]) 

The first element is dist(x[0], x[1]), the second is dist(x[0], x[2]) and the third is dist(x[1], x[2]).

Or you can view it as the elements in the upper triangular part of the square distance matrix, strung together into a 1D array.

E.g.

>>> squareform(pdist(x))  array([[  0.   ,  10.   ,  22.361],        [ 10.   ,   0.   ,  14.142],        [ 22.361,  14.142,   0.   ]])  >>> y = array([[0,10],[10,10],[20,20],[10,0]]) >>> squareform(pdist(y))  array([[  0.   ,  10.   ,  22.361,  14.142],        [ 10.   ,   0.   ,  14.142,  10.   ],        [ 22.361,  14.142,   0.   ,  22.361],        [ 14.142,  10.   ,  22.361,   0.   ]]) >>> pdist(y) array([ 10.   ,  22.361,  14.142,  14.142,  10.   ,  22.361]) 
like image 181
Warren Weckesser Avatar answered Sep 19 '22 09:09

Warren Weckesser


Condensed distance matrix to full distance matrix

A condensed distance matrix as returned by pdist can be converted to a full distance matrix by using scipy.spatial.distance.squareform:

>>> import numpy as np >>> from scipy.spatial.distance import pdist, squareform >>> points = np.array([[0,1],[1,1],[3,5], [15, 5]]) >>> dist_condensed = pdist(points) >>> dist_condensed array([  1.        ,   5.        ,  15.5241747 ,   4.47213595,         14.56021978,  12.        ]) 

Use squareform to convert to full matrix:

>>> dist = squareform(dist_condensed) array([[  0.        ,   1.        ,   5.        ,  15.5241747 ],        [  1.        ,   0.        ,   4.47213595,  14.56021978],        [  5.        ,   4.47213595,   0.        ,  12.        ],        [ 15.5241747 ,  14.56021978,  12.        ,   0.        ]]) 

Distance between point i,j is stored in dist[i, j]:

>>> dist[2, 0] 5.0 >>> np.linalg.norm(points[2] - points[0]) 5.0 

Indices to condensed index

One can convert indices used for accessing the elements of the square matrix to the index in the condensed matrix:

def square_to_condensed(i, j, n):     assert i != j, "no diagonal elements in condensed matrix"     if i < j:         i, j = j, i     return n*j - j*(j+1)//2 + i - 1 - j 

Example:

>>> square_to_condensed(1, 2, len(points)) 3 >>> dist_condensed[3] 4.4721359549995796 >>> dist[1,2] 4.4721359549995796 

Condensed index to indices

Also the other direction is possible without sqaureform which is better in terms of runtime and memory consumption:

import math  def calc_row_idx(k, n):     return int(math.ceil((1/2.) * (- (-8*k + 4 *n**2 -4*n - 7)**0.5 + 2*n -1) - 1))  def elem_in_i_rows(i, n):     return i * (n - 1 - i) + (i*(i + 1))//2  def calc_col_idx(k, i, n):     return int(n - elem_in_i_rows(i + 1, n) + k)  def condensed_to_square(k, n):     i = calc_row_idx(k, n)     j = calc_col_idx(k, i, n)     return i, j 

Example:

>>> condensed_to_square(3, 4) (1.0, 2.0) 

Runtime comparison with squareform

>>> import numpy as np >>> from scipy.spatial.distance import pdist, squareform >>> points = np.random.random((10**4,3)) >>> %timeit dist_condensed = pdist(points) 1 loops, best of 3: 555 ms per loop 

Creating the sqaureform turns out to be really slow:

>>> dist_condensed = pdist(points) >>> %timeit dist = squareform(dist_condensed) 1 loops, best of 3: 2.25 s per loop 

If we are searching two points with maximum distance it is not surprising that searching in full matrix is O(n) while in condensed form only O(n/2):

>>> dist = squareform(dist_condensed) >>> %timeit dist_condensed.argmax() 10 loops, best of 3: 45.2 ms per loop >>> %timeit dist.argmax() 10 loops, best of 3: 93.3 ms per loop 

Getting the inideces for the two points takes almost no time in both cases but of course there is some overhead for calculating the condensed index:

>>> idx_flat = dist.argmax() >>> idx_condensed = dist.argmax() >>> %timeit list(np.unravel_index(idx_flat, dist.shape)) 100000 loops, best of 3: 2.28 µs per loop >>> %timeit condensed_to_square(idx_condensed, len(points)) 100000 loops, best of 3: 14.7 µs per loop 
like image 34
lumbric Avatar answered Sep 23 '22 09:09

lumbric