I have 1 million 5-dimensional points that I need to group into k clusters with k << 1 million. In each cluster, no two points should be too far apart (e.g. they could be bounding spheres with a specified radius). That means that there probably has to be many clusters of size 1.
But! I need the running time to be well below n^2. n log n or so should be fine. The reason I'm doing this clustering is to avoid computing a distance matrix of all n points (which takes n^2 time or many hours), instead I want to just compute distances between clusters.
I tried the pycluster k-means algorithm but quickly realized it's way too slow. I've also tried the following greedy approach:
Slice space into 20 pieces in each dimension. (so there are 20^5 total pieces). I will store clusters in these gridboxes, according to their centroids.
For each point, retrieve the gridboxes that are within r (maximum bounding sphere radius). If there is a near enough cluster, add it to that cluster, otherwise make a new cluster.
However, this seems to give me more clusters than I want. I also implemented approaches similar to this twice, and they give very different answers.
Are there any standard approaches to clustering in faster than n^2 time? Probabilistic algorithms are ok.
If it is well-separated clusters, then k-means is the fastest. If it is overlapping dataset, then efficiency and effectiveness are both important, thus fuzzy clustering methods are recommended solutions.
DBSCAN produces a varying number of clusters, based on the input data. Here's a list of advantages of KMeans and DBScan: KMeans is much faster than DBScan. DBScan doesn't need number of clusters.
The main disavantage of DBSCAN is that is much more prone to noise, which may lead to false clustering. On the other hand, HDBSCAN focus on high density clustering, which reduces this noise clustering problem and allows a hierarchical clustering based on a decision tree approach.
Consider an approximate nearest neighbor (ANN) algorithm or locality sensitive hashing (LSH). They don't directly solve the clustering problem, but they will be able to tell you which points are "close" to one another. By altering the parameters, you can define close to be as close as you want. And it's fast.
More precisely, LSH can provide a hash function, h
, such that, for two points x
and y
, and distance metric d
,
d(x,y) <= R1 => P(h(x) = h(y)) >= P1 d(x,y) >= R2 => P(h(x) = h(y)) <= P2
where R1 < R2
and P1 > P2
. So yes, it is probabilistic. You can postprocess the retrieved data to arrive at true clusters.
Here is information on LSH including the E2LSH manual. ANN is similar in spirit; David Mount has information here, or try FLANN (has Matlab and Python bindings).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With