Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How Could One Implement the K-Means++ Algorithm?

I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k centroids are picked, namely the initialization as the rest is like in the original K-Means algorithm.

  1. Is the probability function used based on distance or Gaussian?
  2. In the same time the most long distant point (From the other centroids) is picked for a new centroid.

I will appreciate a step by step explanation and an example. The one in Wikipedia is not clear enough. Also a very well commented source code would also help. If you are using 6 arrays then please tell us which one is for what.

like image 455
Anton Andreev Avatar asked Mar 28 '11 23:03

Anton Andreev


People also ask

How implement K-means algorithm in Python?

Step-1: Select the value of K, to decide the number of clusters to be formed. Step-2: Select random K points which will act as centroids. Step-3: Assign each data point, based on their distance from the randomly selected points (Centroid), to the nearest/closest centroid which will form the predefined clusters.

How K-means algorithm works with example?

K-means Clustering Method:Partition of objects into k non-empty subsets. Identifying the cluster centroids (mean point) of the current partition. Assigning each point to a specific cluster. Compute the distances from each point and allot points to the cluster where the distance from the centroid is minimum.


1 Answers

Interesting question. Thank you for bringing this paper to my attention - K-Means++: The Advantages of Careful Seeding

In simple terms, cluster centers are initially chosen at random from the set of input observation vectors, where the probability of choosing vector x is high if x is not near any previously chosen centers.

Here is a one-dimensional example. Our observations are [0, 1, 2, 3, 4]. Let the first center, c1, be 0. The probability that the next cluster center, c2, is x is proportional to ||c1-x||^2. So, P(c2 = 1) = 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, where a = 1/(1+4+9+16).

Suppose c2=4. Then, P(c3 = 1) = 1a, P(c3 = 2) = 4a, P(c3 = 3) = 1a, where a = 1/(1+4+1).

I've coded the initialization procedure in Python; I don't know if this helps you.

def initialize(X, K):     C = [X[0]]     for k in range(1, K):         D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X])         probs = D2/D2.sum()         cumprobs = probs.cumsum()         r = scipy.rand()         for j,p in enumerate(cumprobs):             if r < p:                 i = j                 break         C.append(X[i])     return C 

EDIT with clarification: The output of cumsum gives us boundaries to partition the interval [0,1]. These partitions have length equal to the probability of the corresponding point being chosen as a center. So then, since r is uniformly chosen between [0,1], it will fall into exactly one of these intervals (because of break). The for loop checks to see which partition r is in.

Example:

probs = [0.1, 0.2, 0.3, 0.4] cumprobs = [0.1, 0.3, 0.6, 1.0] if r < cumprobs[0]:     # this event has probability 0.1     i = 0 elif r < cumprobs[1]:     # this event has probability 0.2     i = 1 elif r < cumprobs[2]:     # this event has probability 0.3     i = 2 elif r < cumprobs[3]:     # this event has probability 0.4     i = 3 
like image 124
Steve Tjoa Avatar answered Sep 22 '22 00:09

Steve Tjoa