Today i'm trying to learn something about K-means. I Have understand the algorithm and i know how it works. Now i'm looking for the right k... I found the elbow criterion as a method to detect the right k but i do not understand how to use it with scikit learn?! In scikit learn i'm clustering things in this way <pre class="prettyprint"><code>kmeans = KMeans(init='k-means++', n_clusters=n_clusters, n_init=10) kmeans.fit(data) </code></pre> So should i do this several times for n_clusters = 1...n and watch at the Error rate to get the right k ? think this would be stupid and would take a lot of time?!

If the true label is not known in advance(as in your case), then <code>K-Means clustering</code> can be evaluated using either Elbow Criterion or Silhouette Coefficient. Elbow Criterion Method: The idea behind elbow method is to run k-means clustering on a given dataset for a range of values of k (<code>num_clusters</code>, e.g k=1 to 10), and for each value of k, calculate sum of squared errors (SSE). After that, plot a line graph of the SSE for each value of k. If the line graph looks like an arm - a red circle in below line graph (like angle), the "elbow" on the arm is the value of optimal k (number of cluster). Here, we want to minimize SSE. SSE tends to decrease toward 0 as we increase k (and SSE is 0 when k is equal to the number of data points in the dataset, because then each data point is its own cluster, and there is no error between it and the center of its cluster). So the goal is to choose a <code>small value of k</code> that still has a low SSE, and the elbow usually represents where we start to have diminishing returns by increasing k. Let's consider iris datasets, <pre class="prettyprint"><code>import pandas as pd from sklearn.datasets import load_iris from sklearn.cluster import KMeans import matplotlib.pyplot as plt iris = load_iris() X = pd.DataFrame(iris.data, columns=iris['feature_names']) #print(X) data = X[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)']] sse = {} for k in range(1, 10): kmeans = KMeans(n_clusters=k, max_iter=1000).fit(data) data["clusters"] = kmeans.labels_ #print(data["clusters"]) sse[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center plt.figure() plt.plot(list(sse.keys()), list(sse.values())) plt.xlabel("Number of cluster") plt.ylabel("SSE") plt.show() </code></pre> Plot for above code: <img src="https://i.stack.imgur.com/BHYK4.png" alt="enter image description here"> We can see in plot, 3 is the optimal number of clusters (encircled red) for iris dataset, which is indeed correct. Silhouette Coefficient Method: From sklearn documentation, A higher Silhouette Coefficient score relates to a model with better-defined clusters. The Silhouette Coefficient is defined for each sample and is composed of two scores: ` <blockquote> a: The mean distance between a sample and all other points in the same class. </blockquote> <blockquote> b: The mean distance between a sample and all other points in the next nearest cluster. </blockquote> The Silhouette Coefficient is for a single sample is then given as: <img src="https://latex.codecogs.com/png.latex?s=%5Cfrac%7Bb-a%7D%7Bmax(a,b)%7D" title="s=\frac{b-a}{max(a,b)}"> Now, to find the optimal value of <code>k</code> for <code>KMeans</code>, loop through 1..n for n_clusters in <code>KMeans</code> and calculate Silhouette Coefficient for each sample. A higher Silhouette Coefficient indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. <pre class="prettyprint"><code>from sklearn.metrics import silhouette_score from sklearn.datasets import load_iris from sklearn.cluster import KMeans X = load_iris().data y = load_iris().target for n_cluster in range(2, 11): kmeans = KMeans(n_clusters=n_cluster).fit(X) label = kmeans.labels_ sil_coeff = silhouette_score(X, label, metric='euclidean') print("For n_clusters={}, The Silhouette Coefficient is {}".format(n_cluster, sil_coeff)) </code></pre> Output - For n_clusters=2, The Silhouette Coefficient is 0.680813620271 For n_clusters=3, The Silhouette Coefficient is 0.552591944521 For n_clusters=4, The Silhouette Coefficient is 0.496992849949 For n_clusters=5, The Silhouette Coefficient is 0.488517550854 For n_clusters=6, The Silhouette Coefficient is 0.370380309351 For n_clusters=7, The Silhouette Coefficient is 0.356303270516 For n_clusters=8, The Silhouette Coefficient is 0.365164535737 For n_clusters=9, The Silhouette Coefficient is 0.346583642095 For n_clusters=10, The Silhouette Coefficient is 0.328266088778 As we can see, n_clusters=2 has highest Silhouette Coefficient. This means that 2 should be the optimal number of cluster, Right? But here's the catch. Iris dataset has 3 species of flower, which contradicts the 2 as an optimal number of cluster. So despite n_clusters=2 having highest Silhouette Coefficient, We would consider n_clusters=3 as optimal number of cluster due to - <ol> <li>Iris dataset has 3 species. (Most Important) </li> <li> n_clusters=2 has a 2nd highest value of Silhouette Coefficient.</li> </ol> So choosing n_clusters=3 is the optimal no. of cluster for iris dataset. Choosing optimal no. of the cluster will depend on the type of datasets and the problem we are trying to solve. But most of the cases, taking highest Silhouette Coefficient will yield an optimal number of cluster. Hope it helps!

Scikit Learn - K-Means - Elbow - criterion

Tags:

python

machine-learning

cluster-analysis

k-means

scikit-learn

Today i'm trying to learn something about K-means. I Have understand the algorithm and i know how it works. Now i'm looking for the right k... I found the elbow criterion as a method to detect the right k but i do not understand how to use it with scikit learn?! In scikit learn i'm clustering things in this way

kmeans = KMeans(init='k-means++', n_clusters=n_clusters, n_init=10)  kmeans.fit(data)

So should i do this several times for n_clusters = 1...n and watch at the Error rate to get the right k ? think this would be stupid and would take a lot of time?!

915

asked Oct 05 '13 12:10

Linda

1 Answers

If the true label is not known in advance(as in your case), then K-Means clustering can be evaluated using either Elbow Criterion or Silhouette Coefficient.

Elbow Criterion Method:

The idea behind elbow method is to run k-means clustering on a given dataset for a range of values of k (num_clusters, e.g k=1 to 10), and for each value of k, calculate sum of squared errors (SSE).

After that, plot a line graph of the SSE for each value of k. If the line graph looks like an arm - a red circle in below line graph (like angle), the "elbow" on the arm is the value of optimal k (number of cluster). Here, we want to minimize SSE. SSE tends to decrease toward 0 as we increase k (and SSE is 0 when k is equal to the number of data points in the dataset, because then each data point is its own cluster, and there is no error between it and the center of its cluster).

So the goal is to choose a small value of k that still has a low SSE, and the elbow usually represents where we start to have diminishing returns by increasing k.

Let's consider iris datasets,

import pandas as pd from sklearn.datasets import load_iris from sklearn.cluster import KMeans import matplotlib.pyplot as plt  iris = load_iris() X = pd.DataFrame(iris.data, columns=iris['feature_names']) #print(X) data = X[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)']]  sse = {} for k in range(1, 10):     kmeans = KMeans(n_clusters=k, max_iter=1000).fit(data)     data["clusters"] = kmeans.labels_     #print(data["clusters"])     sse[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center plt.figure() plt.plot(list(sse.keys()), list(sse.values())) plt.xlabel("Number of cluster") plt.ylabel("SSE") plt.show()

Plot for above code: enter image description here

We can see in plot, 3 is the optimal number of clusters (encircled red) for iris dataset, which is indeed correct.

Silhouette Coefficient Method:

From sklearn documentation,

A higher Silhouette Coefficient score relates to a model with better-defined clusters. The Silhouette Coefficient is defined for each sample and is composed of two scores: `

a: The mean distance between a sample and all other points in the same class.

b: The mean distance between a sample and all other points in the next nearest cluster.

The Silhouette Coefficient is for a single sample is then given as:

$s=\frac{b-a}{max(a,b)}$

Now, to find the optimal value of k for KMeans, loop through 1..n for n_clusters in KMeans and calculate Silhouette Coefficient for each sample.

A higher Silhouette Coefficient indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.

from sklearn.metrics import silhouette_score from sklearn.datasets import load_iris from sklearn.cluster import KMeans  X = load_iris().data y = load_iris().target     for n_cluster in range(2, 11):     kmeans = KMeans(n_clusters=n_cluster).fit(X)     label = kmeans.labels_     sil_coeff = silhouette_score(X, label, metric='euclidean')     print("For n_clusters={}, The Silhouette Coefficient is {}".format(n_cluster, sil_coeff))

Output -

For n_clusters=2, The Silhouette Coefficient is 0.680813620271
For n_clusters=3, The Silhouette Coefficient is 0.552591944521
For n_clusters=4, The Silhouette Coefficient is 0.496992849949
For n_clusters=5, The Silhouette Coefficient is 0.488517550854
For n_clusters=6, The Silhouette Coefficient is 0.370380309351
For n_clusters=7, The Silhouette Coefficient is 0.356303270516
For n_clusters=8, The Silhouette Coefficient is 0.365164535737
For n_clusters=9, The Silhouette Coefficient is 0.346583642095
For n_clusters=10, The Silhouette Coefficient is 0.328266088778

As we can see, n_clusters=2 has highest Silhouette Coefficient. This means that 2 should be the optimal number of cluster, Right?

But here's the catch.

Iris dataset has 3 species of flower, which contradicts the 2 as an optimal number of cluster. So despite n_clusters=2 having highest Silhouette Coefficient, We would consider n_clusters=3 as optimal number of cluster due to -

Iris dataset has 3 species. (Most Important)
n_clusters=2 has a 2nd highest value of Silhouette Coefficient.

So choosing n_clusters=3 is the optimal no. of cluster for iris dataset.

Choosing optimal no. of the cluster will depend on the type of datasets and the problem we are trying to solve. But most of the cases, taking highest Silhouette Coefficient will yield an optimal number of cluster.

Hope it helps!

answered Oct 02 '22 12:10

Om Prakash

Related questions
                            
                                unexpected results converting timezones in python
                            
                                what's the inverse of the quantile function on a pandas Series?
                            
                                Simple Subquery with OuterRef
                            
                                Escaping dollar sign in ipython notebook
                            
                                The view didn't return an HttpResponse object. It returned None instead
                            
                                How to remove all characters before a specific character in Python?
                            
                                Keras Conv2D and input channels
                            
                                Reduce list of Python objects to dict of object.id -> object
                            
                                What's the Python version for “Code against an interface, not an object”?
                            
                                How to delete a directory created with tempfile.mkdtemp?
                            
                                Python return statement error " 'return' outside function"
                            
                                Set vs. frozenset performance
                            
                                How do I write to the console in Google App Engine?
                            
                                TypeError:exceptions must be old-style classes or derived from BaseException, not str
                            
                                Exposing python jupyter on LAN
                            
                                How to write the output to html file with Python BeautifulSoup
                            
                                PYTHON get files from command line
                            
                                Why does this take so long to match? Is it a bug?
                            
                                Format strings vs concatenation
                            
                                How to declare a static attribute in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With