I have a list of opaque objects. I am only able to calculate the distance between them (not true, just setting the conditions for the problem):
class Thing {
public double DistanceTo(Thing other);
}
I would like to cluster these objects. I would like to control the number of clusters and I would like for "close" objects to be in the same cluster:
List<Cluster> cluster(int numClusters, List<Thing> things);
Can anyone suggest (and link to ;-)) some clustering algorithms (the simpler, the better!) or libraries that can help me?
Clarification Most clustering algorithms require that the objects be laid out in some N-dimensional space. This space is used to find "centroids" of clusters. In my case, I do not know what N is, nor do I know how to extract a coordinate system from the objects. All I know is how far apart 2 objects are. I would like to find a good clustering algorithm that uses only that information.
Imagine that you are clustering based upon the "smell" of an object. You don't know how to lay "smells out" on a 2D plane, but you do know whether two smells are similar or not.
I think you are looking for K-Medoids. It's like K-means in that you specify the number of clusters, K, in advance, but it doesn't require that you have a notion of "averaging" the objects you're clustering like K-means does.
Instead, every cluster has a representative medoid, which is the member of the cluster closest to the middle. You could think of it as a version of K-means that finds "medians" instead of "means". All you need is a distance metric to cluster things, and I've used this in some of my own work for exactly the same reasons you cite.
Naive K-medoids is not the fastest algorithm, but there are fast variants that are probably good enough for your purposes. Here are descriptions of the algorithms and links to the documentation for their implementations in R:
If you need more information, here's a paper that gives an overview of these and other K-medoids methods.
Here's an outline for a clustering algorithm that doesn't have the K-means requirement of finding a centroid.
This algorithm will certainly cluster the objects. But its runtime is O(n^2). Plus it is guided by those first n points chosen.
Can anyone improve upon this (better runtime perf, less dependent upon initial choices)? I would love to see your ideas.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With