Why Kruskal clustering generates suboptimal classes?

Tags:

I was trying to develop a clustering algorithm tasked with finding k classes on a set of 2D points, (with k given as input) using use the Kruskal algorithm lightly modified to find k spanning trees instead of one.

I compared my output to a proposed optimum (1) using the rand index, which for k = 7 resulted on 95.5%. The comparison can be seen on the link below.

Problem:

The set have 5 clearly spaced clusters that are easily classified by the algorithm, but the results are rather disappointing for k > 5, which is when things start to get tricky. I believe that my algorithm is correct, and maybe the data is particularly bad for a Kruskal approach. Single Linkage Agglomerative Clustering, such as Kruskal's, are known to perform badly at some problems since it reduces the assessment of cluster quality to a single similarity between a pair of points.

The idea of the algorithm is very simple:

Make a complete graph with the data set, with the weight of the edges being the euclidean distance between the pair.
Sort the edge list by weight.
For each edge (in order), add it to the spanning forest if it doesn't form a cycle. Stop when all the edges have been traversed or when the remaining forest has k trees.

enter image description here

Bottomline: Why is the algorithm failing like that? Is it Kruskal's fault? If so, why precisely? Any suggestions to improve the results without abandoning Kruskal?

(1): Gionis, A., H. Mannila, and P. Tsaparas, Clustering aggregation. ACM Transactions on Knowledge Discovery from Data(TKDD),2007.1(1):p.1-30.

315

asked Dec 05 '13 03:12

rgcalsaverini

2 Answers

This is known as single-link effect.

Kruskal seems to be a semi-clever way of computing single-linkage clustering. The naive approach for "hierarchical clustering" is O(n^3), and the Kruskal approach should be O(n^2 log n) due to having to sort the n^2 edges.

Note that SLINK can do single-linkage clustering in O(n^2) runtime and O(n) memory.

Have you tried loading your data set e.g. into ELKI, and compare your result to single-link clustering.

To get bette results, try other linkages (usually in O(n^3) runtime) or density-based clustering such as DBSCAN (in O(n^2) without index, and O(n log n) with index). On this toy data set, epsilon=2 and minPts=5 should work good.

119

answered Sep 28 '22 10:09

Has QUIT--Anony-Mousse

The bridges between clusters that should be different are a classic example of Kruskal getting things wrong. You might try, for each point, overwriting the shortest distance from that point with the second shortest distance from that point - this might increase the lengths in the bridges without increasing other lengths.

By eye, this looks like something K-means might do well - except for the top left, the clusters are nearly circular.

answered Sep 28 '22 10:09

mcdowella

Related questions
                            
                                DFS Algorithm Traversal
                            
                                Find interval containing given time instant
                            
                                Find all the paths forming simple cycles on an undirected graph
                            
                                How to walk binary abstract syntax tree to generate infix notation with minimally correct parentheses
                            
                                How to find a group of integers(N) amongst records, that contains 6 integers
                            
                                Stackoverflow with Quicksort Java implementation
                            
                                Ceil function using limited set of arithmetic operators
                            
                                Is there a midpoint ellipse algorithm?
                            
                                Why is greedy algorithm not finding maximum independent set of a bipartite graph?
                            
                                Finding edge connectivity of a network by using Maximum Flow algorithm
                            
                                Algorithm: Distance transform - any faster algorithm?
                            
                                Is java.secure.random a sufficient choice for gambling industry?
                            
                                calculate a row of numbers(see context for details)
                            
                                Johnson's Algorithm graph explanation
                            
                                Find pairwise overlaps of intervals (segments)
                            
                                Algorithm for plotting a group of lines efficiently
                            
                                Bounded knapsack special case - small individual item weight is small compared to the number of items
                            
                                How to solve systems of XOR equations?
                            
                                Lowest value in range
                            
                                alternative to recursion based merge sort logic

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why Kruskal clustering generates suboptimal classes?

Tags:

algorithm

tree

cluster-analysis

minimum-spanning-tree

kruskals-algorithm

rgcalsaverini

People also ask

2 Answers

Has QUIT--Anony-Mousse

mcdowella

Recent Activity

Donate For Us