Fast (< n^2) clustering algorithm

Q: Which is the fastest clustering algorithm?

If it is well-separated clusters, then k-means is the fastest. If it is overlapping dataset, then efficiency and effectiveness are both important, thus fuzzy clustering methods are recommended solutions.

Q: Is DBSCAN faster than K-means?

DBSCAN produces a varying number of clusters, based on the input data. Here's a list of advantages of KMeans and DBScan: KMeans is much faster than DBScan. DBScan doesn't need number of clusters.

Q: How is Hdbscan better than DBSCAN?

The main disavantage of DBSCAN is that is much more prone to noise, which may lead to false clustering. On the other hand, HDBSCAN focus on high density clustering, which reduces this noise clustering problem and allows a hierarchical clustering based on a decision tree approach.

Tags:

algorithm

machine-learning

cluster-analysis

k-means

data-mining

I have 1 million 5-dimensional points that I need to group into k clusters with k << 1 million. In each cluster, no two points should be too far apart (e.g. they could be bounding spheres with a specified radius). That means that there probably has to be many clusters of size 1.

But! I need the running time to be well below n^2. n log n or so should be fine. The reason I'm doing this clustering is to avoid computing a distance matrix of all n points (which takes n^2 time or many hours), instead I want to just compute distances between clusters.

I tried the pycluster k-means algorithm but quickly realized it's way too slow. I've also tried the following greedy approach:

Slice space into 20 pieces in each dimension. (so there are 20^5 total pieces). I will store clusters in these gridboxes, according to their centroids.
For each point, retrieve the gridboxes that are within r (maximum bounding sphere radius). If there is a near enough cluster, add it to that cluster, otherwise make a new cluster.

However, this seems to give me more clusters than I want. I also implemented approaches similar to this twice, and they give very different answers.

Are there any standard approaches to clustering in faster than n^2 time? Probabilistic algorithms are ok.

343

asked Dec 09 '10 23:12

John Hawksley

1 Answers

Consider an approximate nearest neighbor (ANN) algorithm or locality sensitive hashing (LSH). They don't directly solve the clustering problem, but they will be able to tell you which points are "close" to one another. By altering the parameters, you can define close to be as close as you want. And it's fast.

More precisely, LSH can provide a hash function, h, such that, for two points x and y, and distance metric d,

d(x,y) <= R1  =>  P(h(x) = h(y)) >= P1 d(x,y) >= R2  =>  P(h(x) = h(y)) <= P2

where R1 < R2 and P1 > P2. So yes, it is probabilistic. You can postprocess the retrieved data to arrive at true clusters.

Here is information on LSH including the E2LSH manual. ANN is similar in spirit; David Mount has information here, or try FLANN (has Matlab and Python bindings).

169

answered Oct 11 '22 13:10

Steve Tjoa

Related questions
                            
                                Max double slice sum
                            
                                Shuffle string c#
                            
                                Python iterate through array while finding the mean of the top k elements
                            
                                How do I use for_each to output to cout?
                            
                                Generating unique, hard-to-guess "coupon" codes
                            
                                Do dicts preserve iteration order if they are not modified?
                            
                                How can we find second maximum from array efficiently?
                            
                                Algorithm to generate all combinations of a string
                            
                                How do you implement the factorial function in C++? [duplicate]
                            
                                Gravity Sort : Is this possible programmatically? [closed]
                            
                                Point and ellipse (rotated) position test: algorithm
                            
                                Find the number in an array that is closest to a given number
                            
                                Random simple connected graph generation with given sparseness
                            
                                List of all classification algorithms
                            
                                Hungarian Algorithm: finding minimum number of lines to cover zeroes?
                            
                                How to test a hash function?
                            
                                Algorithm for Hand writing recognition
                            
                                axis‐aligned rectangles intersection
                            
                                How to detect if a directed graph is cyclic?
                            
                                Distance from a point to a polygon

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With