I was trying to develop a clustering algorithm tasked with finding k classes on a set of 2D points, (with k given as input) using use the Kruskal algorithm lightly modified to find k spanning trees instead of one.
I compared my output to a proposed optimum (1) using the rand index, which for k = 7 resulted on 95.5%. The comparison can be seen on the link below.
Problem:
The set have 5 clearly spaced clusters that are easily classified by the algorithm, but the results are rather disappointing for k > 5, which is when things start to get tricky. I believe that my algorithm is correct, and maybe the data is particularly bad for a Kruskal approach. Single Linkage Agglomerative Clustering, such as Kruskal's, are known to perform badly at some problems since it reduces the assessment of cluster quality to a single similarity between a pair of points.
The idea of the algorithm is very simple:
Bottomline: Why is the algorithm failing like that? Is it Kruskal's fault? If so, why precisely? Any suggestions to improve the results without abandoning Kruskal?
(1): Gionis, A., H. Mannila, and P. Tsaparas, Clustering aggregation. ACM Transactions on Knowledge Discovery from Data(TKDD),2007.1(1):p.1-30.
The basic difference, I would say, is that given a set of nodes, Dijkstra's algorithm finds the shortest path between 2 nodes. Which does not necessarily cover all the nodes in the graph. However on Kruskal's case, the algorithm tries to cover all the nodes while keeping the edge cost minimum.
To get the optimal number of clusters for hierarchical clustering, we make use a dendrogram which is tree-like chart that shows the sequences of merges or splits of clusters. If two clusters are merged, the dendrogram will join them in a graph and the height of the join will be the distance between those clusters.
According to the gap statistic method, k=12 is also determined as the optimal number of clusters (Figure 13). We can visually compare k-Means clusters with k=9 (optimal according to the elbow method) and k=12 (optimal according to the silhouette and gap statistic methods) (see Figure 14).
This is known as single-link effect.
Kruskal seems to be a semi-clever way of computing single-linkage clustering. The naive approach for "hierarchical clustering" is O(n^3)
, and the Kruskal approach should be O(n^2 log n)
due to having to sort the n^2
edges.
Note that SLINK can do single-linkage clustering in O(n^2)
runtime and O(n)
memory.
Have you tried loading your data set e.g. into ELKI, and compare your result to single-link clustering.
To get bette results, try other linkages (usually in O(n^3)
runtime) or density-based clustering such as DBSCAN (in O(n^2)
without index, and O(n log n)
with index). On this toy data set, epsilon=2
and minPts=5
should work good.
The bridges between clusters that should be different are a classic example of Kruskal getting things wrong. You might try, for each point, overwriting the shortest distance from that point with the second shortest distance from that point - this might increase the lengths in the bridges without increasing other lengths.
By eye, this looks like something K-means might do well - except for the top left, the clusters are nearly circular.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With