I am trying to k-means clustering with selected initial centroids. It says here that to specify your initial centers:
init : {‘k-means++’, ‘random’ or an ndarray}
If an ndarray
is passed, it should be of shape (n_clusters
, n_features
) and gives the initial centers.
My code in Python:
X = np.array([[-19.07480000, -8.536],
[22.010800000,-10.9737],
[12.659700000,19.2601]], np.float64)
km = KMeans(n_clusters=3,init=X).fit(data)
# print km
centers = km.cluster_centers_
print centers
Returns an error:
RuntimeWarning: Explicit initial center position passed: performing only one init in k-means instead of n_init=10
n_jobs=self.n_jobs)
and return the same initial centers. Any idea how to form the initial centers so it can be accepted?
k-Means [1] is one of the most important algorithm for Clustering. Traditional k-Means algorithm selects initial centroids randomly and in k-Means algorithm result of clustering highly depends on selection of initial centroids.
to specify the initial centroids, you just need to pass your array of centroids as a value to the parameter init . Example: from sklearn.cluster import KMeans import numpy as np my_centroids = np.array([[1, 2, 3, 4, 5], [2, 4, 6, 5, 3], [1, 2, 5, 7, 1]]) kmeans = KMeans(n_clusters=3, random_state=0, init=my_centroids)
Specifically, K-means tends to perform better when centroids are seeded in such a way that doesn't clump them together in space. In short, the method is as follows: Choose one of your data points at random as an initial centroid. Calculate D(x), the distance between your initial centroid and all other data points, x.
Also, a form of hierarchical clustering (often Ward's method) can be used as a method to find the initial cluster centers, which can then be passed off to k -means for the actual data clustering task. This can be effective, but since it would mean also discussing hierarchical clustering we will leave this until a later article.
The default behavior of KMeans is to initialize the algorithm multiple times using different random centroids (i.e. the Forgy method ). The number of random initializations is then controlled by the n_init= parameter ( docs ): Number of time the k-means algorithm will be run with different centroid seeds.
random data points: In this approach, described in the "traditional" case above, k random data points are selected from the dataset and used as the initial centroids, an approach which is obviously highly volatile and provides for a scenario where the selected centroids are not well positioned throughout the entire data space.
One (the “Forgy” method) is to randomly select k data points to be the centers of the k-clusters, the other (the “Random Partition” method) assigns each observation, randomly, to one of k different clusters. Then you start refining by either the cluster membership and then cluster center, or cluster center then membership.
The default behavior of KMeans
is to initialize the algorithm multiple times using different random centroids (i.e. the Forgy method). The number of random initializations is then controlled by the n_init=
parameter (docs):
n_init : int, default: 10
Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of
n_init
consecutive runs in terms of inertia.
If you pass an array as the init=
argument then only a single initialization will be performed using the centroids explicitly specified in the array. You are getting a RuntimeWarning
because you are still passing the default value of n_init=10
(here are the relevant lines of source code).
It's actually totally fine to ignore this warning, but you can make it go away completely by passing n_init=1
if your init=
parameter is an array.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With