Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Triangle vs. Square distance matrix for Hierarchical Clustering Python? [duplicate]

I have been experimenting with Hierarchical Clustering and in R it's so simple hclust(as.dist(X),method="average") . I found a method in Python that is pretty simple as well, except I'm a little confused on what's going on with my input distance matrix.

I have a similarity matrix (DF_c93tom w/ a smaller test version called DF_sim) that I convert into a dissimilarity matrix DF_dissm = 1 - DF_sim.

I use this as input into linkage from scipy but the documentation says it takes in a square or triangle matrix. I get a different cluster for inputing a lower triangle, upper triangle, and square matrix. Why is this? It wants an upper triangle from the documentation but the lower triangle cluster looks REALLY similar.

My question, why are all the clusters different? Which one is correct?

This is the documentation for the input distance matrix for linkage

y : ndarray
A condensed or redundant distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix. 

Here is my code:

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from scipy.cluster.hierarchy import dendrogram, linkage

%matplotlib inline

#Test Data
DF_sim = DF_c93tom.iloc[:10,:10] #Similarity Matrix
DF_sim.columns = DF_sim.index = range(10) 
#print(DF_test)
#           0  1         2         3  4  5  6  7  8  9
# 0  1.000000  0  0.395833  0.083333  0  0  0  0  0  0
# 1  0.000000  1  0.000000  0.000000  0  0  0  0  0  0
# 2  0.395833  0  1.000000  0.883792  0  0  0  0  0  0
# 3  0.083333  0  0.883792  1.000000  0  0  0  0  0  0
# 4  0.000000  0  0.000000  0.000000  1  0  0  0  0  0
# 5  0.000000  0  0.000000  0.000000  0  1  0  0  0  0
# 6  0.000000  0  0.000000  0.000000  0  0  1  0  0  0
# 7  0.000000  0  0.000000  0.000000  0  0  0  1  0  0
# 8  0.000000  0  0.000000  0.000000  0  0  0  0  1  0
# 9  0.000000  0  0.000000  0.000000  0  0  0  0  0  1

#Dissimilarity Matrix
DF_dissm = 1 - DF_sim

#Redundant Matrix
#np.tril(DF_dissm).T == np.triu(DF_dissm)
#True for all values

#Hierarchical Clustering for square and triangle matrices
fig_1 = plt.figure(1)
plt.title("Square")
Z_square = linkage((DF_dissm.values),method="average")
dendrogram(Z_square)

fig_2 = plt.figure(2)
plt.title("Triangle Upper")
Z_triu = linkage(np.triu(DF_dissm.values),method="average")
dendrogram(Z_triu)

fig_3 = plt.figure(3)
plt.title("Triangle Lower")
Z_tril = linkage(np.tril(DF_dissm.values),method="average")
dendrogram(Z_tril)

plt.show()

enter image description here

like image 818
O.rka Avatar asked Feb 08 '23 07:02

O.rka


1 Answers

When a 2D array is passed as the first argument to scipy.cluster.hierarchy.linkage, it is treated as a sequence of observations, and scipy.spatial.pdist is used to convert it to a squence of pairwise distances between observations.

There is a github issue regarding this behavior since it means that passing a "distance matrix" such as DF_dissm.values (silently) produces an incorrect result.

So the upshot of this is that none of these

Z_square = linkage((DF_dissm.values),method="average")
Z_triu = linkage(np.triu(DF_dissm.values),method="average")
Z_tril = linkage(np.tril(DF_dissm.values),method="average")

produce the desired result. Instead use

  • np.triu_indices:

    h, w = arr.shape
    Z = linkage(arr[np.triu_indices(h, 1)], method="average")
    
  • or spatial.distance.squareform:

    from scipy.spatial import distance as ssd
    Z = linkage(ssd.squareform(arr), method="average")
    
  • or apply spatial.distance.pdist to the original points:

    Z = hierarchy.linkage(ssd.pdist(points), method="average")
    
  • or pass the 2D array points:

    Z = hierarchy.linkage(points, method="average")
    

import matplotlib.pyplot as plt
import numpy as np
from scipy.cluster import hierarchy as hier
from scipy.spatial import distance as ssd
np.random.seed(2016)

points = np.random.random((10, 2))
arr = ssd.cdist(points, points)

fig, ax = plt.subplots(nrows=4)

ax[0].set_title("condensed upper triangular")
Z = hier.linkage(arr[np.triu_indices(arr.shape[0], 1)], method="average")
hier.dendrogram(Z, ax=ax[0])

ax[1].set_title("squareform")
Z = hier.linkage(ssd.squareform(arr), method="average")
hier.dendrogram(Z, ax=ax[1])

ax[2].set_title("pdist")
Z = hier.linkage(ssd.pdist(points), method="average")
hier.dendrogram(Z, ax=ax[2])

ax[3].set_title("sequence of observations")
Z = hier.linkage(points, method="average")
hier.dendrogram(Z, ax=ax[3])

plt.show()

enter image description here

like image 70
unutbu Avatar answered Feb 12 '23 12:02

unutbu