Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how do I get the subtrees of dendrogram made by scipy.cluster.hierarchy

Tags:

I had a confusion regarding this module (scipy.cluster.hierarchy) ... and still have some !

For example we have the following dendrogram:

hierarchical clustering

My question is how can I extract the coloured subtrees (each one represent a cluster) in a nice format, say SIF format ? Now the code to get the plot above is:

import scipy import scipy.cluster.hierarchy as sch import matplotlib.pylab as plt  scipy.randn(100,2)  d = sch.distance.pdist(X)  Z= sch.linkage(d,method='complete')  P =sch.dendrogram(Z)  plt.savefig('plot_dendrogram.png')  T = sch.fcluster(Z, 0.5*d.max(), 'distance') #array([4, 5, 3, 2, 2, 3, 5, 2, 2, 5, 2, 2, 2, 3, 2, 3, 2, 5, 4, 5, 2, 5, 2, #       3, 3, 3, 1, 3, 4, 2, 2, 4, 2, 4, 3, 3, 2, 5, 5, 5, 3, 2, 2, 2, 5, 4, #       2, 4, 2, 2, 5, 5, 1, 2, 3, 2, 2, 5, 4, 2, 5, 4, 3, 5, 4, 4, 2, 2, 2, #       4, 2, 5, 2, 2, 3, 3, 2, 4, 5, 3, 4, 4, 2, 1, 5, 4, 2, 2, 5, 5, 2, 2, #       5, 5, 5, 4, 3, 3, 2, 4], dtype=int32)  sch.leaders(Z,T) # (array([190, 191, 182, 193, 194], dtype=int32), #  array([2, 3, 1, 4,5],dtype=int32)) 

So now, the output of fcluster() gives the clustering of the nodes (by their id's), and leaders() described here is supposed to return 2 arrays:

  • first one contains the leader nodes of the clusters generated by Z, here we can see we have 5 clusters, as well as in the plot

  • and the second one the id's of these clusters

So if this leaders() returns resp. L and M : L[2]=182 and M[2]=1, then cluster 1 is leaded by node id 182, which doesn't exist in the observations set X, the documentation says "... then it corresponds to a non-singleton cluster". But I can't get it ...

Also, I converted the Z to a tree by sch.to_tree(Z), that will return an easy-to-use tree object, which I want to visualize, but which tool should I use as a graphical platform that manipulate these kind of tree objects as inputs?

like image 647
titan Avatar asked Jun 02 '13 13:06

titan


People also ask

How do you interpret dendrogram hierarchical clustering?

The key to interpreting a dendrogram is to focus on the height at which any two objects are joined together. In the example above, we can see that E and F are most similar, as the height of the link that joins them together is the smallest. The next two most similar objects are A and B.

How do you find the optimal number of clusters in hierarchical clustering?

To get the optimal number of clusters for hierarchical clustering, we make use a dendrogram which is tree-like chart that shows the sequences of merges or splits of clusters. If two clusters are merged, the dendrogram will join them in a graph and the height of the join will be the distance between those clusters.

How do you read a dendrogram chart?

There are two ways to interpret a dendrogram: in terms of large-scale groups or in terms of similarities among individual chunks. To identify large-scale groups, we start reading from the top down, finding the branch points that are at high levels in the structure.


1 Answers

Answering the part of your question regarding tree manipulation...

As explained in aother answer, you can read the coordinates of the branches reading icoord and dcoord from the tree object. For each branch the coordinated are given from the left to the right.

If you want to manually plot the tree you can use something like:

def plot_tree(P, pos=None):     plt.clf()     icoord = scipy.array(P['icoord'])     dcoord = scipy.array(P['dcoord'])     color_list = scipy.array(P['color_list'])     xmin, xmax = icoord.min(), icoord.max()     ymin, ymax = dcoord.min(), dcoord.max()     if pos:         icoord = icoord[pos]         dcoord = dcoord[pos]         color_list = color_list[pos]     for xs, ys, color in zip(icoord, dcoord, color_list):         plt.plot(xs, ys, color)     plt.xlim(xmin-10, xmax + 0.1*abs(xmax))     plt.ylim(ymin, ymax + 0.1*abs(ymax))     plt.show() 

Where, in your code, plot_tree(P) gives:

enter image description here

The function allows you to select just some branches:

plot_tree(P, range(10)) 

enter image description here

Now you have to know which branches to plot. Maybe the fcluster() output is a little obscure and another way to find which branches to plot based on a minimum and a maximum distance tolerance would be using the output of linkage() directly (Z in the OP's case):

dmin = 0.2 dmax = 0.3 pos = scipy.all( (Z[:,2] >= dmin, Z[:,2] <= dmax), axis=0 ).nonzero() plot_tree( P, pos ) 

Recommended references:

  • How does condensed distance matrix work? (pdist)
  • how to plot and annotate hierarchical clustering dendrograms in scipy/matplotlib
like image 169
Saullo G. P. Castro Avatar answered Oct 16 '22 02:10

Saullo G. P. Castro