Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

K-means++ algorithm

Tags:

algorithm

I read the paper k-means++: The Advantages of Careful Seeding and didn't quite understand the algorithm provided which is:

"Let D(x) denote the shortest distance from a data point x to the closest center we have already chosen.

1a. Choose an initial center c1 uniformly at random from X.

1b. Choose the next center ci, selecting ci = x' ∈ X with probability (D(x')^2) / Sum_of(D(x)^2)

1c. Repeat Step 1b until we have chosen a total of k centers.

2-4. Proceed as with the standard k-means algorithm"

(Better look at the algorithm in the link above)

Especially step 1b. What do they mean by "selecting ci = x' ∈ X with probability (D(x')^2) / Sumof(D(x)^2)". Do they mean selecting the element that has the largest proportion ? And how can perform such computation can lead to select the optimal centroids ?

like image 438
Long Phan Avatar asked Jul 05 '13 01:07

Long Phan


1 Answers

The function D(x) is defined for all points x ∈ X.

In step 1b, we must choose some point to be a new center. We will choose randomly between all points (that are not already centers). But we will not give an equal chance to every point; we will assign different probabilities to different points before we choose. These probabilities must add up to 1.

Consider D(x)^2. We can evaluate this at every point, and add up the values: Sum_of(D(x)^2).

Then we can assign each point x' a probability equal to D(x')^2 / Sum_of(D(x)^2). These probabilities add up to 1, and give a better chance to points far away from all existing centers.

like image 155
Beta Avatar answered Oct 07 '22 21:10

Beta