Does anyone have a paper that explains how the Ckmeans.1d.dp algorithm works?
Or: what is the most optimal way to do k-means clustering in one-dimension?
On the one hand, it is notoriously difficult to define a distance between data points in high-dimensional scRNAseq space due to the Curse of Dimensionality; one the other hand, clustering algorithms often use idealistic assumptions which do not hold for the real world data.
Limitations of Cluster AnalysisThe different methods of clustering usually give very different results. This occurs because of the different criterion for merging clusters (including cases). It is important to think carefully about which method is best for what you are interested in looking at.
A cluster is defined and characterized based on the attributes present in the cluster. Clustering High-Dimensional Data we need to search for clusters and find out the space for the existing clusters. The High-Dimensional data is reduced to low-dimension data to make the clustering and search for clusters simple.
Univariate k-means clustering can be solved in O(kn) time (on already sorted input) based on theoretical results on Monge matrices, but the approach was not popular most likely due to numerical instability and also perhaps coding challenges.
A better option is an O(knlgn) method that is now implemented in Ckmeans.1d.dp version 3.4.6. This implementation is as fast as heuristic k-means but offers guaranteed optimality, orders of magnitude better than heuristic k-means especially for large k's.
The generic dynamic programming solution by Richard Bellman (1973) does not touch upon specifics of the k-means problem and the implied runtime is O(kn^3).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With