Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to cluster objects (without coordinates)

I have a list of opaque objects. I am only able to calculate the distance between them (not true, just setting the conditions for the problem):

class Thing {
    public double DistanceTo(Thing other);
}

I would like to cluster these objects. I would like to control the number of clusters and I would like for "close" objects to be in the same cluster:

List<Cluster> cluster(int numClusters, List<Thing> things);

Can anyone suggest (and link to ;-)) some clustering algorithms (the simpler, the better!) or libraries that can help me?

Clarification Most clustering algorithms require that the objects be laid out in some N-dimensional space. This space is used to find "centroids" of clusters. In my case, I do not know what N is, nor do I know how to extract a coordinate system from the objects. All I know is how far apart 2 objects are. I would like to find a good clustering algorithm that uses only that information.

Imagine that you are clustering based upon the "smell" of an object. You don't know how to lay "smells out" on a 2D plane, but you do know whether two smells are similar or not.

like image 639
Frank Krueger Avatar asked Mar 28 '09 00:03

Frank Krueger


2 Answers

I think you are looking for K-Medoids. It's like K-means in that you specify the number of clusters, K, in advance, but it doesn't require that you have a notion of "averaging" the objects you're clustering like K-means does.

Instead, every cluster has a representative medoid, which is the member of the cluster closest to the middle. You could think of it as a version of K-means that finds "medians" instead of "means". All you need is a distance metric to cluster things, and I've used this in some of my own work for exactly the same reasons you cite.

Naive K-medoids is not the fastest algorithm, but there are fast variants that are probably good enough for your purposes. Here are descriptions of the algorithms and links to the documentation for their implementations in R:

  1. PAM is the basic O(n^2) implementation of K-medoids.
  2. CLARA is a much faster, sampled version of PAM. It works by clustering randomly sampled subset of objects with PAM and grouping the entire set of objects based on the subset. You should still be able to get very good clusterings fast with this.

If you need more information, here's a paper that gives an overview of these and other K-medoids methods.

like image 149
Todd Gamblin Avatar answered Oct 19 '22 22:10

Todd Gamblin


Here's an outline for a clustering algorithm that doesn't have the K-means requirement of finding a centroid.

  1. Determine the distance between all objects. Record the n most separate objects.
    [finds roots of our clusters, time O(n^2)]
  2. Assign each of these n random points to n new distinct clusters.
  3. For every other object:
    [assign objects to clusters, time O(n^2)]
    1. For each cluster:
      1. Calculate the average distance from a cluster to that object by averaging the distance of each object in the cluster to the object.
    2. Assign the object to the closest cluster.

This algorithm will certainly cluster the objects. But its runtime is O(n^2). Plus it is guided by those first n points chosen.

Can anyone improve upon this (better runtime perf, less dependent upon initial choices)? I would love to see your ideas.

like image 20
Frank Krueger Avatar answered Oct 19 '22 23:10

Frank Krueger