Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to choose initial centroids for k-means clustering

I am working on implementing k-means clustering in Python. What is the good way to choose initial centroids for a data set? For instance: I have following data set:

A,1,1
B,2,1
C,4,4
D,4,5

I need to create two different clusters. How do i start with the centroids?

like image 603
Clint Whaley Avatar asked Mar 12 '16 00:03

Clint Whaley


People also ask

How do you choose the best initial centroids K-means?

In short, the method is as follows: Choose one of your data points at random as an initial centroid. Calculate D(x), the distance between your initial centroid and all other data points, x. Choose your next centroid from the remaining datapoints with probability proportional to D(x)2.

How many initial centroids needs to be chosen to begin K-means algorithm?

K Means++ The algorithm is as follows: Choose one centroid uniformly at random from among the data points. For each data point say x, compute D(x), which is the distance between x and the nearest centroid that has already been chosen.

How are centroids chosen?

Initial centroids are often chosen randomly. Clusters produced vary from one run to another. is (typically) the mean of the points in the cluster. 'Closeness' is measured by Euclidean distance, cosine similarity, etc.

What is the problem with selecting initial points in K-means?

To formulate the algorithm in manual calculation using mathematics is also quite simple. However, behind its advantages, k-means has a limitation in its rule for choosing the cluster centroids. It is too sensitive to the initial cluster centroids — when we choose different initial values, the result will be different.


1 Answers

You might want to learn about K-means++ method, because it's one of the most popular, easy and giving consistent results way of choosing initial centroids. Here you have paper on it. It works as follows:

  • Choose one center uniformly at random from among the data points.
  • For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
  • Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)^2 (You can use scipy.stats.rv_discrete for that).
  • Repeat Steps 2 and 3 until k centers have been chosen.
  • Now that the initial centers have been chosen, proceed using standard k-means clustering.
like image 137
Tony Babarino Avatar answered Sep 28 '22 06:09

Tony Babarino