Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most mutually distant k elements (clustering?)

I have a simple machine learning question:

I have n (~110) elements, and a matrix of all the pairwise distances. I would like to choose the 10 elements that are most far apart. That is, I want to

Maximize:
  Choose 10 different elements.
  Return min distance over (all pairings within the 10).

My distance metric is symmetric and respects the triangle inequality.

What kind of algorithm can I use? My first instinct is to do the following:

  1. Cluster the n elements into 20 clusters.
  2. Replace each cluster with just the element of that cluster that is furthest from the mean element of the original n.
  3. Use brute force to solve the problem on the remaining 20 candidates. Luckily, 20 choose 10 is only 184,756.

Edit: thanks to etarion's insightful comment, changed "Return sum of (distances)" to "Return min distance" in the optimization problem statement.

like image 437
AlcubierreDrive Avatar asked Mar 23 '11 04:03

AlcubierreDrive


People also ask

What does inertia K mean?

K-Means: Inertia Inertia measures how well a dataset was clustered by K-Means. It is calculated by measuring the distance between each data point and its centroid, squaring this distance, and summing these squares across one cluster. A good model is one with low inertia AND a low number of clusters ( K ).

Is K-means clustering supervised or unsupervised?

K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in supervised learning. K-Means performs the division of objects into clusters that share similarities and are dissimilar to the objects belonging to another cluster. The term 'K' is a number.

How many clusters are in K-means?

The optimal number of clusters k is the one that maximize the average silhouette over a range of possible values for k. This also suggests an optimal of 2 clusters.

What is needed by K-means clustering?

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster.


2 Answers

Here's how you might approach this combinatorial optimization problem by taking the convex relaxation.

Let D be an upper triangular matrix with your distances on the upper triangle. I.e. where i < j, D_i,j is the distance between elements i and j. (Presumably, you'll have zeros on the diagonal, as well.)

Then your objective is to maximize x'*D*x, where x is binary valued with 10 elements set to 1 and the rest to 0. (Setting the ith entry in x to 1 is analogous to selecting the ith element as one of your 10 elements.)

The "standard" convex optimization thing to do with a combinatorial problem like this is to relax the constraints such that x need not be discrete valued. Doing so gives us the following problem:

maximize y'*D*y subject to: 0 <= y_i <= 1 for all i, 1'*y = 10

This is (morally) a quadratic program. (If we replace D with D + D', it'll become a bona fide quadratic program and the y you get out should be no different.) You can use an off-the-shelf QP solver, or just plug it in to the convex optimization solver of your choice (e.g. cvx).

The y you get out need not be (and probably won't be) a binary vector, but you can convert the scalar values to discrete ones in a bunch of ways. (The simplest is probably to let x be 1 in the 10 entries where y_i is highest, but you might need to do something a little more complicated.) In any case, y'*D*y with the y you get out does give you an upper bound for the optimal value of x'*D*x, so if the x you construct from y has x'*D*x very close to y'*D*y, you can be pretty happy with your approximation.

Let me know if any of this is unclear, notation or otherwise.

like image 200
Dan Robinson Avatar answered Oct 19 '22 16:10

Dan Robinson


Nice question.

I'm not sure if it can be solved exactly in an efficient manner, and your clustering based solution seems reasonable. Another direction to look at would be local search method such as simulated annealing and hill climbing.

Here's an obvious baseline I would compare any other solution against:

  1. Repeat 100 times:
  2. Greedily select the datapoint that whose removal decreases the objective function the least and remove it.
like image 2
carlosdc Avatar answered Oct 19 '22 14:10

carlosdc