I am working on implementing k-means clustering in Python. What is the good way to choose initial centroids for a data set? For instance: I have following data set: <pre class="prettyprint"><code>A,1,1 B,2,1 C,4,4 D,4,5 </code></pre> I need to create two different clusters. How do i start with the centroids?

You might want to learn about K-means++ method, because it's one of the most popular, easy and giving consistent results way of choosing initial centroids. Here you have paper on it. It works as follows: <ul> <li>Choose one center uniformly at random from among the data points.</li> <li>For each data point <code>x</code>, compute <code>D(x)</code>, the distance between <code>x</code> and the nearest center that has already been chosen.</li> <li>Choose one new data point at random as a new center, using a weighted probability distribution where a point <code>x</code> is chosen with probability proportional to <code>D(x)^2</code> (You can use scipy.stats.rv_discrete for that).</li> <li>Repeat Steps 2 and 3 until <code>k</code> centers have been chosen.</li> <li>Now that the initial centers have been chosen, proceed using standard k-means clustering.</li> </ul>

how to choose initial centroids for k-means clustering

Tags:

python

cluster-analysis

k-means

data-mining

centroid

I am working on implementing k-means clustering in Python. What is the good way to choose initial centroids for a data set? For instance: I have following data set:

A,1,1
B,2,1
C,4,4
D,4,5

I need to create two different clusters. How do i start with the centroids?

603

asked Mar 12 '16 00:03

Clint Whaley

1 Answers

You might want to learn about K-means++ method, because it's one of the most popular, easy and giving consistent results way of choosing initial centroids. Here you have paper on it. It works as follows:

Choose one center uniformly at random from among the data points.
For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)^2 (You can use scipy.stats.rv_discrete for that).
Repeat Steps 2 and 3 until k centers have been chosen.
Now that the initial centers have been chosen, proceed using standard k-means clustering.

137

answered Sep 28 '22 06:09

Tony Babarino

Related questions
                            
                                Setting x axis label to bottom in openpyxl
                            
                                Django order by highest number of likes
                            
                                no module named fuzzywuzzy
                            
                                Pygame, set transparency on an image imported using convert_alpha()
                            
                                Change order of columns in Flask-Admin list view
                            
                                Numpy Dot Product of two 2-d arrays in numpy to get 3-d array
                            
                                Python: Selenium WebDriver find_elements_by_class_name
                            
                                Print dict with custom class as values wont call their string method?
                            
                                How to avoid Pandas Groupby key error when a GroupBy object might not contain a certain key
                            
                                Python Twitter Bot w/ Heroku Error: R10 Boot Timeout
                            
                                Why is statsmodels throwing an IndedxError when I try to fit a linear mixed-effect model?
                            
                                What is the name of the driver to connect to Azure SQL Database from pyodbc in Azure ML?
                            
                                How to use python multiprocessing module in django view
                            
                                Conditional field requirement with DjangoRestFramework serializer
                            
                                Save XLSX file to a specified location using OpenPyXL
                            
                                How to properly update requests in Ubuntu 14.04
                            
                                How can SciKit-Learn Random Forest sub sample size may be equal to original training data size?
                            
                                Pandas : vectorized operations on maximum values per row
                            
                                Google Cloud vision API: "Request had insufficient authentication scopes."
                            
                                Replace some specific values in pandas column based on conditions in other column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With