I read the paper k-means++: The Advantages of Careful Seeding and didn't quite understand the algorithm provided which is:
"Let D(x) denote the shortest distance from a data point x to the closest center we have already chosen.
1a. Choose an initial center c1 uniformly at random from X.
1b. Choose the next center ci, selecting ci = x' ∈ X with probability (D(x')^2) / Sum_of(D(x)^2)
1c. Repeat Step 1b until we have chosen a total of k centers.
2-4. Proceed as with the standard k-means algorithm"
(Better look at the algorithm in the link above)
Especially step 1b. What do they mean by "selecting ci = x' ∈ X with probability (D(x')^2) / Sumof(D(x)^2)". Do they mean selecting the element that has the largest proportion ? And how can perform such computation can lead to select the optimal centroids ?
The function D(x) is defined for all points x ∈ X.
In step 1b, we must choose some point to be a new center. We will choose randomly between all points (that are not already centers). But we will not give an equal chance to every point; we will assign different probabilities to different points before we choose. These probabilities must add up to 1.
Consider D(x)^2. We can evaluate this at every point, and add up the values: Sum_of(D(x)^2).
Then we can assign each point x' a probability equal to D(x')^2 / Sum_of(D(x)^2). These probabilities add up to 1, and give a better chance to points far away from all existing centers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With