A dendrogram is a data structure used with hierarchical clustering algorithms that groups clusters at different "heights" of a tree - where the heights correspond to distance measures between clusters.
After a dendrogram is created from some input data set, it's often a further challenge to figure out where to "cut" the dendrogram, meaning selecting a height such that only clusters below that height are considered meaningful.
It's not always clear at what height to cut a dendrogram, but some algorithms exist, such as the DynamicTreeCut algorithm, that attempt to programatically select meaningful clusters from the dendrogram.
See:
https://stats.stackexchange.com/questions/3685/where-to-cut-a-dendrogram
Cutting dendrogram at highest level of purity
So I've been reading over the DynamicTreeCut algorithm, as well as a Java implementation of the algorithm. I understand how the algorithm works and what it's doing, in terms of a step by step breakdown of what's going on. However, I am failing to understand how this algorithm is doing anything meaningful. I think I am missing some key concept here.
In general, the algorithm iterates over a "height sequence" from a dendrogram. I'm not certain, but I assume that a "height sequence" just means the values along the Y axis of a dendrogram, i.e. the various heights at which cluster joins occur. If that's the case, we can assume that a "height sequence" is sorted in ascending order.
Then the algorithm calls for taking a "reference height", l
, and subtracting it from each height in the input "height sequence". This gives you a vector of differences (D
) between each height h[i]
in the height sequence and the reference height l
.
Then the algorithm attempts to find "transition points" - which are points in the differences vector where D[i] > 0
and D[i+1] < 0
. In other words, points in the differences vector where the difference value transitioned from being positive to being negative.
And right here I am totally lost. I don't understand how these transition points can ever be meaningful. Firstly, it's my understanding that the input height sequence H
is just the values on the Y axis of the dendrogram. So therefore the height sequence H
should be in ascending order. Therefore, how can there ever be a point in the differences vector where we transition from positive to negative?
For example:
Suppose our input height sequence H is {1, 1.5, 2, 2.5, 3, 7, 9}
and our reference value l
is the average height (which would be 3.7
). So if we create a difference vector D
by subtracting l
from each height in H
, we'd get {-2.7, -2.2, -1.7, -1.2, -0.7, 3.3, 5.3}
. So clearly there is no transition point here, nor could there ever be, because there are no points in the difference vector where D[i] > 0
and D[i+1] < 0
, since the height sequence H
is in ascending order.
So clearly I am totally misunderstanding something fundamental about this algorithm. Perhaps I am not understanding what is meant by the "height sequence". I am assuming it is simply the values on the Y-axis from the dendrogram, but clearly that doesn't make any sense with regard to what the algorithm is actually doing. Unfortunately, the authors don't really explain what is meant by a "dendrogram height sequence", nor does it appear to be some kind of standard terminology in use by the data science community.
So can something explain what the DynamicTreeCut algorithm is trying to achieve here, and where my understanding is going wrong?
Sometimes, it is also known as Hierarchical cluster analysis (HCA). In this algorithm, we try to create the hierarchy of clusters in the form of a tree, and this tree-shaped structure is known as the Dendrogram.
Hierarchical clustering is where you build a cluster tree (a dendrogram) to represent data, where each group (or “node”) links to two or more successor groups.
How to read a dendrogram. The key to interpreting a dendrogram is to focus on the height at which any two objects are joined together. In the example above, we can see that E and F are most similar, as the height of the link that joins them together is the smallest. The next two most similar objects are A and B.
We present the Dynamic Tree Cut R library that implements novel dynamic branch cutting methods for detecting clusters in a dendrogram depending on their shape.
Dynamic tree cut is a top-down algorithm that relies solely on the dendrogram. The algorithm implements an adaptive, iterative process of cluster decomposition and combination and stops when the number of clusters becomes stable. Dynamic hybrid cut is a bottom-up algorithm that improves the detection of outlying members of each cluster.
Cutting a dendrogram at a certain level gives a set of clusters. Cutting at another level gives another set of clusters. How would you pick where to cut the dendrogram?
Hierarchical clustering can be represented by a dendrogram. Cutting a dendrogram at a certain level gives a set of clusters. Cutting at another level gives another set of clusters. How would you pick where to cut the dendrogram?
I fully agree with your understanding of the paper and the algorithm. I reached the same conclusion as you did.
What I am offering is speculation. But I feel pretty confident about it.
The immediate conclusion is that you cannot give H as the ordered list of heights. Instead, it should be the height when you reorder the points for a visualization, i.e "such that the lines in the resulting plot do not cross"
In the example, H would become (0.70, 0.35, 0.90, 0.45, 0.77, 0.40, 0.30, 0.55, 0.05). To clarify, the first entry 0.70 is the height of the merging line between the 1st and the 2nd point (called 3 and 7 here). Note that the visualization is not unique, but the result of the algorithm in the end will be.
Conclusion : because inside a cluster the merging heights are low, and a cluster have their H-l values positive, you want a big pack of low merging heights (i.e : a cluster) to stand above the 0-line. So instead of using the merging heights, use the negative
In the example, H would become (-0.70, -0.35, -0.90, ...).
Let's try my assumption and have l = -0.75
H-l becomes (0.05, 0.40, -0.15, 0.30, -0.02, 0.35, 0.45, 0.20, 0.70)
And you have two transitions that define 3 clusters : (3-7-6), (1-4) and (9-5-8-10-2). See the picture for reference. Which feels really reasonable.
What I can also conclude is that it was a really roundabout way of saying that it was fixed height branch cut. Note that this is only TreeCutCore so all the dynamic part is left to be done. Honestly it's not that complex when you realize they just do recursive calls to TreeCutCore with varying heights on smaller and smaller clusters.
Also, as another insurance that I am not completely wrong when you have several negative values one after the other, it means that it corresponds to singletons that are outliers which is exactly what Node 54 (of Figure 5 of the paper you linked) is. Several contiguous negative values don't form a cluster themselves, they are singletons really dissimilar from each other. Only contiguous groups above 0 form a cluster
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With