I am trying to implement the Canopy clustering algorithm along with K-Means. I've done some searching online that says to use Canopy clustering to get your initial starting points to feed into K-means, the problem is, in Canopy clustering, you need to specify 2 threshold values for the canopy: T1 and T2, where points in the inner threshold are strongly tied to that canopy and the points in the wider threshold are less tied to that canopy. How are these threshold, or distances from the canopy center, determined?
Problem context:
The problem I'm trying to solve is, I have a set of numbers such as [1,30] or [1,250] with set sizes of about 50. There can be duplicate elements and they can be floating point numbers as well, such as 8, 17.5, 17.5, 23, 66, ... I want to find the optimal clusters, or subsets of the set of numbers.
So, if Canopy clustering with K-means is a good choice, then my questions still stands: how do you find the T1, T2 values?. If this is not a good choice, is there a better, simpler but effective algorithm to use?
Threshold clustering (TC) is a recently-developed method designed so that, given a pre-specified threshold t*, each cluster contains at least t* units and the maximum within-cluster dissimilarity (MWCD) is small.
You can draw the points and the centers via matplotlib's scatter function. Colors can be assigned depending on the group calculated via kmeans . Here is an example (the kmeans function now also return the centroids). Here is an attempt to show the given target names together with the kmeans approximation.
Perhaps naively, I see the problem in terms of a sort of spectral-estimation. Suppose I have 10 vectors. I can compute the distances between all pairs. In this case I'd get 45 such distances. Plot them as a histogram in various distance ranges. E.g. 10 distances are between 0.1 and 0.2, 5 between 0.2 and 0.3 etc. and you get an idea of how the distances between vectors are distributed. From this information you can choose T1 and T2 (e.g. choose them so that you cover the distance range that is the most populated).
Of course, this is not practical for a large dataset - but you could just take a random sample or something so that you at least know the ballpark of T1 and T2. Using something like Hadoop you could do some sort of prior spectral estimation on a large number of points. If all incoming data you are trying to cluster is distributed in much the same way then you cjust need to get T1 and T2 once, then fix them as constants for all future runs.
Actually that is the big issue with Canopy Clustering. Choosing the thresholds is pretty much as difficult as the actual algorithm. In particular in high dimensions. For a 2D geographic data set, a domain expert can probably define the distance thresholds easily. But in high-dimensional data, probably the best you can do is to run k-means on a sample of your data first, then choose the distances based on this sample run.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With