I have a distance matrix n*n M
where M_ij
is the distance between object_i
and object_j
. So as expected, it takes the following form:
/ 0 M_01 M_02 ... M_0n\ | M_10 0 M_12 ... M_1n | | M_20 M_21 0 ... M2_n | | ... | \ M_n0 M_n2 M_n2 ... 0 /
Now I wish to cluster these n objects with hierarchical clustering. Python has an implementation of this called scipy.cluster.hierarchy.linkage(y, method='single', metric='euclidean')
.
Its documentation says:
y must be a {n \choose 2} sized vector where n is the number of original observations paired in the distance matrix.
y : ndarray
A condensed or redundant distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix. This is the form that pdist returns. Alternatively, a collection of m observation vectors in n dimensions may be passed as an m by n array.
I am confused by this description of y
. Can I directly feed my M
in as the input y
?
Update
@hongbo-zhu-cn has raised this issue up in GitHub. This is exactly what I am concerning about. However, as a newbie to GitHub, I don't know how it works and therefore have no idea how this issue is dealt with.
For most common hierarchical clustering software, the default distance measure is the Euclidean distance. This is the square root of the sum of the square differences. However, for gene expression, correlation distance is often used.
In single-link (or single linkage) hierarchical clustering, we merge in each step the two clusters whose two closest members have the smallest distance (or: the two clusters with the smallest minimum pairwise distance). Complete-link clustering can also be described using the concept of clique.
Performs hierarchical/agglomerative clustering on the condensed distance matrix y. sized vector where n is the number of original observations paired in the distance matrix. The behavior of this function is very similar to the MATLAB linkage function.
Since we are using complete linkage clustering, the distance between "35" and every other item is the maximum of the distance between this item and 3 and this item and 5. For example, d(1,3)= 3 and d(1,5)=11. So, D(1,"35")=11. This gives us the new distance matrix.
It seems that indeed we cannot directly pass the redundant square matrix in, although the documentation claims we can do so.
To benefit anyone who faces the same problem in the future, I write my solution as an additional answer here. So the copy-and-paste guys can just proceed with the clustering.
Use the following snippet to condense the matrix and happily proceed.
import scipy.spatial.distance as ssd # convert the redundant n*n square matrix form into a condensed nC2 array distArray = ssd.squareform(distMatrix) # distArray[{n choose 2}-{n-i choose 2} + (j-i-1)] is the distance between points i and j
Please correct me if I am wrong.
For now you should pass in the 'condensed distance matrix', i.e. just the upper triangle of the distance matrix in vector form:
y = M[np.triu_indices(n,1)]
From the discussion of @hongbo-zhu-cn's pull request it looks as though the solution will be to add an extra keyword argument to the linkage
function that will allow the user to explicitly specify that they are passing in an n x n distance matrix rather than an m x n observation matrix.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With