Understanding DynamicTreeCut algorithm for cutting a dendrogram

Q: Which algorithm produces a dendrogram?

Sometimes, it is also known as Hierarchical cluster analysis (HCA). In this algorithm, we try to create the hierarchy of clusters in the form of a tree, and this tree-shaped structure is known as the Dendrogram.

Q: Which clustering algorithm uses a dendrogram?

Hierarchical clustering is where you build a cluster tree (a dendrogram) to represent data, where each group (or “node”) links to two or more successor groups.

Q: How do you read and interpret a dendrogram?

How to read a dendrogram. The key to interpreting a dendrogram is to focus on the height at which any two objects are joined together. In the example above, we can see that E and F are most similar, as the height of the link that joins them together is the smallest. The next two most similar objects are A and B.

Q: What is the dynamic tree cut R library?

We present the Dynamic Tree Cut R library that implements novel dynamic branch cutting methods for detecting clusters in a dendrogram depending on their shape.

Q: What is the difference between dynamic tree cut and hybrid cut?

Dynamic tree cut is a top-down algorithm that relies solely on the dendrogram. The algorithm implements an adaptive, iterative process of cluster decomposition and combination and stops when the number of clusters becomes stable. Dynamic hybrid cut is a bottom-up algorithm that improves the detection of outlying members of each cluster.

Q: What happens when you cut the dendrogram at a certain level?

Cutting a dendrogram at a certain level gives a set of clusters. Cutting at another level gives another set of clusters. How would you pick where to cut the dendrogram?

Q: How do you represent hierarchical clustering in a dendrogram?

Hierarchical clustering can be represented by a dendrogram. Cutting a dendrogram at a certain level gives a set of clusters. Cutting at another level gives another set of clusters. How would you pick where to cut the dendrogram?

Tags:

algorithm

cluster-analysis

unsupervised-learning

dendrogram

hierarchical-clustering

A dendrogram is a data structure used with hierarchical clustering algorithms that groups clusters at different "heights" of a tree - where the heights correspond to distance measures between clusters.

After a dendrogram is created from some input data set, it's often a further challenge to figure out where to "cut" the dendrogram, meaning selecting a height such that only clusters below that height are considered meaningful.

It's not always clear at what height to cut a dendrogram, but some algorithms exist, such as the DynamicTreeCut algorithm, that attempt to programatically select meaningful clusters from the dendrogram.

See:

https://stats.stackexchange.com/questions/3685/where-to-cut-a-dendrogram

Cutting dendrogram at highest level of purity

So I've been reading over the DynamicTreeCut algorithm, as well as a Java implementation of the algorithm. I understand how the algorithm works and what it's doing, in terms of a step by step breakdown of what's going on. However, I am failing to understand how this algorithm is doing anything meaningful. I think I am missing some key concept here.

In general, the algorithm iterates over a "height sequence" from a dendrogram. I'm not certain, but I assume that a "height sequence" just means the values along the Y axis of a dendrogram, i.e. the various heights at which cluster joins occur. If that's the case, we can assume that a "height sequence" is sorted in ascending order.

Then the algorithm calls for taking a "reference height", l, and subtracting it from each height in the input "height sequence". This gives you a vector of differences (D) between each height h[i] in the height sequence and the reference height l.

Then the algorithm attempts to find "transition points" - which are points in the differences vector where D[i] > 0 and D[i+1] < 0. In other words, points in the differences vector where the difference value transitioned from being positive to being negative.

And right here I am totally lost. I don't understand how these transition points can ever be meaningful. Firstly, it's my understanding that the input height sequence H is just the values on the Y axis of the dendrogram. So therefore the height sequence H should be in ascending order. Therefore, how can there ever be a point in the differences vector where we transition from positive to negative?

For example:

Suppose our input height sequence H is {1, 1.5, 2, 2.5, 3, 7, 9} and our reference value l is the average height (which would be 3.7). So if we create a difference vector D by subtracting l from each height in H, we'd get {-2.7, -2.2, -1.7, -1.2, -0.7, 3.3, 5.3}. So clearly there is no transition point here, nor could there ever be, because there are no points in the difference vector where D[i] > 0 and D[i+1] < 0, since the height sequence H is in ascending order.

So clearly I am totally misunderstanding something fundamental about this algorithm. Perhaps I am not understanding what is meant by the "height sequence". I am assuming it is simply the values on the Y-axis from the dendrogram, but clearly that doesn't make any sense with regard to what the algorithm is actually doing. Unfortunately, the authors don't really explain what is meant by a "dendrogram height sequence", nor does it appear to be some kind of standard terminology in use by the data science community.

So can something explain what the DynamicTreeCut algorithm is trying to achieve here, and where my understanding is going wrong?

967

asked Sep 03 '16 15:09

Siler

1 Answers

I fully agree with your understanding of the paper and the algorithm. I reached the same conclusion as you did.

What I am offering is speculation. But I feel pretty confident about it.

The paper states that breakpoints give you limits between clusters. Two consecutive breakpoints define a cluster

The immediate conclusion is that you cannot give H as the ordered list of heights. Instead, it should be the height when you reorder the points for a visualization, i.e "such that the lines in the resulting plot do not cross"

Example of dendrogram

In the example, H would become (0.70, 0.35, 0.90, 0.45, 0.77, 0.40, 0.30, 0.55, 0.05). To clarify, the first entry 0.70 is the height of the merging line between the 1st and the 2nd point (called 3 and 7 here). Note that the visualization is not unique, but the result of the algorithm in the end will be.

Breakpoints and transition are defined a contiguous set of points above the 0-line.

Conclusion : because inside a cluster the merging heights are low, and a cluster have their H-l values positive, you want a big pack of low merging heights (i.e : a cluster) to stand above the 0-line. So instead of using the merging heights, use the negative

In the example, H would become (-0.70, -0.35, -0.90, ...).

Let's try my assumption and have l = -0.75

H-l becomes (0.05, 0.40, -0.15, 0.30, -0.02, 0.35, 0.45, 0.20, 0.70)

And you have two transitions that define 3 clusters : (3-7-6), (1-4) and (9-5-8-10-2). See the picture for reference. Which feels really reasonable.

What I can also conclude is that it was a really roundabout way of saying that it was fixed height branch cut. Note that this is only TreeCutCore so all the dynamic part is left to be done. Honestly it's not that complex when you realize they just do recursive calls to TreeCutCore with varying heights on smaller and smaller clusters.

Also, as another insurance that I am not completely wrong when you have several negative values one after the other, it means that it corresponds to singletons that are outliers which is exactly what Node 54 (of Figure 5 of the paper you linked) is. Several contiguous negative values don't form a cluster themselves, they are singletons really dissimilar from each other. Only contiguous groups above 0 form a cluster

198

answered Oct 20 '22 23:10

B. Decoster

Related questions
                            
                                What is the time complexity of Ruby's built in #permutation and #repeated_permutation methods?
                            
                                Logoot CRDT: interleaving of data on concurrent edits to the same spot?
                            
                                Bubble Shuffle - Weighted Shuffle
                            
                                Is Work Stealing always the most appropriate user-level thread scheduling algorithm?
                            
                                How does the LZMA compression method work?
                            
                                Extract human sound from a wav file using java
                            
                                Minimum-Waste Print Job Grouping Algorithm?
                            
                                Placing 2D shapes in a rectangle efficiently. How to approach it?
                            
                                Optimal solution for creating a pile of boxes
                            
                                Genetic algorithm and Tetris
                            
                                which flood-fill algorithm is better for performance?
                            
                                Techniques needed to write an arithmetic expression parser
                            
                                Quick relative ranking algorithm
                            
                                Number of distinct sums of subsets
                            
                                Does .NET really use NFA for regular expression engine?
                            
                                How does OEIS do subsequence search?
                            
                                Finding out whether there exist two identical substrings one next to another
                            
                                Longest palindrome in a string
                            
                                Efficient (time and space complexity) data structure for dense and sparse matrix
                            
                                Quiescent State Based Reclamation vs Epoch Based Reclamation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With