I have a distance matrix with about 5000 entries, and use scipy's hierarchical clustering methods to cluster the matrix. The code I use for this is the following snippet:
Y = fastcluster.linkage(D, method='centroid') # D-distance matrix
Z1 = sch.dendrogram(Y,truncate_mode='level', p=7,show_contracted=True)
Since the dendrogram will become rather dense with all this data, I use the truncate_mode to prune it a bit. All of this works, but I wonder how I can find out which of the original 5000 entries belong to a particular branch in the dendrogram.
I tried using
leaves = sch.leaves_list(Y)
to get a list of leaves, but this uses the linkage output as indata, and while I can see the correspondence between the pruned dendrogram and the leaves-list, it becomes a bit cumbersome to map original entries manually to the dendrogram.
To summarize: Is there a way of listing all the original entries in the distance matrix that belongs to a branch in a pruned dendrogram? Or are there other methods of doing this that I am not aware of.
Thanks
Huge dendrograms can be pruned in the Pruning box by selecting the maximum depth of the dendrogram. This only affects the display, not the actual clustering. The widget offers three different selection methods: Manual (Clicking inside the dendrogram will select a cluster.
The common practice to flatten dendrograms in k clusters is to cut them off at constant height k−1.
Observations are allocated to clusters by drawing a horizontal line through the dendrogram. Observations that are joined together below the line are in clusters.
Dendrograms are a diagrammatic representation of the hierarchical relationship between the data-points. It illustrates the arrangement of the clusters produced by the corresponding analyses and is used to observe the output of hierarchical (agglomerative) clustering.
One of the dictionary data-structures returned by scipy.cluster.hierarchy.dendrogram has the key ivl
, that the documentation describes as:
a list of labels corresponding to the leaf nodes
You can supply custom labels (using labels=<array of lables>
) as input to the dendrogram function but by default, they are just indices of the original observation. By comparing the original labels/indices and Z1['ivl']
, you can determine what the original entries were.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With