How Could One Implement the K-Means++ Algorithm?

Tags:

I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k centroids are picked, namely the initialization as the rest is like in the original K-Means algorithm.

Is the probability function used based on distance or Gaussian?
In the same time the most long distant point (From the other centroids) is picked for a new centroid.

I will appreciate a step by step explanation and an example. The one in Wikipedia is not clear enough. Also a very well commented source code would also help. If you are using 6 arrays then please tell us which one is for what.

455

asked Mar 28 '11 23:03

Anton Andreev

1 Answers

Interesting question. Thank you for bringing this paper to my attention - K-Means++: The Advantages of Careful Seeding

In simple terms, cluster centers are initially chosen at random from the set of input observation vectors, where the probability of choosing vector x is high if x is not near any previously chosen centers.

Here is a one-dimensional example. Our observations are [0, 1, 2, 3, 4]. Let the first center, c1, be 0. The probability that the next cluster center, c2, is x is proportional to ||c1-x||^2. So, P(c2 = 1) = 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, where a = 1/(1+4+9+16).

Suppose c2=4. Then, P(c3 = 1) = 1a, P(c3 = 2) = 4a, P(c3 = 3) = 1a, where a = 1/(1+4+1).

I've coded the initialization procedure in Python; I don't know if this helps you.

def initialize(X, K):     C = [X[0]]     for k in range(1, K):         D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X])         probs = D2/D2.sum()         cumprobs = probs.cumsum()         r = scipy.rand()         for j,p in enumerate(cumprobs):             if r < p:                 i = j                 break         C.append(X[i])     return C

EDIT with clarification: The output of cumsum gives us boundaries to partition the interval [0,1]. These partitions have length equal to the probability of the corresponding point being chosen as a center. So then, since r is uniformly chosen between [0,1], it will fall into exactly one of these intervals (because of break). The for loop checks to see which partition r is in.

Example:

probs = [0.1, 0.2, 0.3, 0.4] cumprobs = [0.1, 0.3, 0.6, 1.0] if r < cumprobs[0]:     # this event has probability 0.1     i = 0 elif r < cumprobs[1]:     # this event has probability 0.2     i = 1 elif r < cumprobs[2]:     # this event has probability 0.3     i = 2 elif r < cumprobs[3]:     # this event has probability 0.4     i = 3

124

answered Sep 22 '22 00:09

Steve Tjoa

Related questions
                            
                                Efficient maths algorithm to calculate intersections
                            
                                Average Runtime of Quickselect
                            
                                longest increasing subsequence(O(nlogn))
                            
                                How to keep track of depth in breadth first search?
                            
                                Finding a single number in a list [duplicate]
                            
                                What is the best way to get the minimum or maximum value from an Array of numbers?
                            
                                How to implement a Median-heap
                            
                                Find maximum possible time HH:MM by permuting four given digits
                            
                                How to check if line segment intersects a rectangle?
                            
                                What is the problem name for Traveling salesman problem(TSP) without considering going back to starting point?
                            
                                Empirically estimating big-oh time efficiency
                            
                                How does Google Docs deal with editing collisions?
                            
                                Given an audio stream, find when a door slams (sound pressure level calculation?)
                            
                                Why are hash table expansions usually done by doubling the size?
                            
                                Genetic algorithm resource [closed]
                            
                                Why do we need prefix, postfix notation
                            
                                Good examples, articles, books for understanding dynamic programming [closed]
                            
                                Segmented Sieve of Eratosthenes?
                            
                                Tracing and Returning a Path in Depth First Search
                            
                                How to rotate a table 45 degrees and save the result into another table?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How Could One Implement the K-Means++ Algorithm?

Tags:

language-agnostic

algorithm

machine-learning

cluster-analysis

k-means

Anton Andreev

People also ask

1 Answers

Steve Tjoa

Recent Activity

Donate For Us