Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pruning dendrogram in scipy (hierarchical clustering)

I have a distance matrix with about 5000 entries, and use scipy's hierarchical clustering methods to cluster the matrix. The code I use for this is the following snippet:

Y = fastcluster.linkage(D, method='centroid') # D-distance matrix
Z1 = sch.dendrogram(Y,truncate_mode='level', p=7,show_contracted=True)

Since the dendrogram will become rather dense with all this data, I use the truncate_mode to prune it a bit. All of this works, but I wonder how I can find out which of the original 5000 entries belong to a particular branch in the dendrogram.

I tried using

 leaves = sch.leaves_list(Y)

to get a list of leaves, but this uses the linkage output as indata, and while I can see the correspondence between the pruned dendrogram and the leaves-list, it becomes a bit cumbersome to map original entries manually to the dendrogram.

To summarize: Is there a way of listing all the original entries in the distance matrix that belongs to a branch in a pruned dendrogram? Or are there other methods of doing this that I am not aware of.

Thanks

like image 263
user1354607 Avatar asked Apr 24 '12 19:04

user1354607


People also ask

Can we prune dendrogram?

Huge dendrograms can be pruned in the Pruning box by selecting the maximum depth of the dendrogram. This only affects the display, not the actual clustering. The widget offers three different selection methods: Manual (Clicking inside the dendrogram will select a cluster.

Where is the dendrogram cut in hierarchical clustering?

The common practice to flatten dendrograms in k clusters is to cut them off at constant height k−1.

How do you cluster a dendrogram?

Observations are allocated to clusters by drawing a horizontal line through the dendrogram. Observations that are joined together below the line are in clusters.

What is Agglomerative dendrogram for hierarchical clustering?

Dendrograms are a diagrammatic representation of the hierarchical relationship between the data-points. It illustrates the arrangement of the clusters produced by the corresponding analyses and is used to observe the output of hierarchical (agglomerative) clustering.


1 Answers

One of the dictionary data-structures returned by scipy.cluster.hierarchy.dendrogram has the key ivl, that the documentation describes as:

a list of labels corresponding to the leaf nodes

You can supply custom labels (using labels=<array of lables>) as input to the dendrogram function but by default, they are just indices of the original observation. By comparing the original labels/indices and Z1['ivl'], you can determine what the original entries were.

like image 71
Dhara Avatar answered Oct 16 '22 20:10

Dhara