I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k
centroids are picked, namely the initialization as the rest is like in the original K-Means algorithm.
I will appreciate a step by step explanation and an example. The one in Wikipedia is not clear enough. Also a very well commented source code would also help. If you are using 6 arrays then please tell us which one is for what.
Step-1: Select the value of K, to decide the number of clusters to be formed. Step-2: Select random K points which will act as centroids. Step-3: Assign each data point, based on their distance from the randomly selected points (Centroid), to the nearest/closest centroid which will form the predefined clusters.
K-means Clustering Method:Partition of objects into k non-empty subsets. Identifying the cluster centroids (mean point) of the current partition. Assigning each point to a specific cluster. Compute the distances from each point and allot points to the cluster where the distance from the centroid is minimum.
Interesting question. Thank you for bringing this paper to my attention - K-Means++: The Advantages of Careful Seeding
In simple terms, cluster centers are initially chosen at random from the set of input observation vectors, where the probability of choosing vector x
is high if x
is not near any previously chosen centers.
Here is a one-dimensional example. Our observations are [0, 1, 2, 3, 4]. Let the first center, c1
, be 0. The probability that the next cluster center, c2
, is x is proportional to ||c1-x||^2
. So, P(c2 = 1) = 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, where a = 1/(1+4+9+16).
Suppose c2=4. Then, P(c3 = 1) = 1a, P(c3 = 2) = 4a, P(c3 = 3) = 1a, where a = 1/(1+4+1).
I've coded the initialization procedure in Python; I don't know if this helps you.
def initialize(X, K): C = [X[0]] for k in range(1, K): D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X]) probs = D2/D2.sum() cumprobs = probs.cumsum() r = scipy.rand() for j,p in enumerate(cumprobs): if r < p: i = j break C.append(X[i]) return C
EDIT with clarification: The output of cumsum
gives us boundaries to partition the interval [0,1]. These partitions have length equal to the probability of the corresponding point being chosen as a center. So then, since r
is uniformly chosen between [0,1], it will fall into exactly one of these intervals (because of break
). The for
loop checks to see which partition r
is in.
Example:
probs = [0.1, 0.2, 0.3, 0.4] cumprobs = [0.1, 0.3, 0.6, 1.0] if r < cumprobs[0]: # this event has probability 0.1 i = 0 elif r < cumprobs[1]: # this event has probability 0.2 i = 1 elif r < cumprobs[2]: # this event has probability 0.3 i = 2 elif r < cumprobs[3]: # this event has probability 0.4 i = 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With