Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sklearn kmeans equivalent of elbow method

Let's say I'm examining up to 10 clusters, with scipy I usually generate the 'elbow' plot as follows:

from scipy import cluster
cluster_array = [cluster.vq.kmeans(my_matrix, i) for i in range(1,10)]

pyplot.plot([var for (cent,var) in cluster_array])
pyplot.show()

I have since became motivated to use sklearn for clustering, however I'm not sure how to create the array needed to plot as in the scipy case. My best guess was:

from sklearn.cluster import KMeans

km = [KMeans(n_clusters=i) for i range(1,10)]
cluster_array = [km[i].fit(my_matrix)]

That unfortunately resulted in an invalid command error. What is the best way sklearn way to go about this?

Thank you

like image 430
Arash Howaida Avatar asked Jan 09 '17 03:01

Arash Howaida


People also ask

What is the elbow method in KMeans?

The elbow method runs k-means clustering on the dataset for a range of values for k (say from 1-10) and then for each value of k computes an average score for all clusters. By default, the distortion score is computed, the sum of square distances from each point to its assigned center.

How do you implement elbow method in Python?

To find the optimal value of clusters, the elbow method follows the below steps: 1 Execute the K-means clustering on a given dataset for different K values (ranging from 1-10). 2 For each value of K, calculates the WCSS value. 3 Plots a graph/curve between WCSS values and the respective number of clusters K.

How do you find the clusters from the elbow method?

The number of clusters is were the elbow bends. The x axis of the plot is the number of clusters and the y axis is the Within Clusters Sum of Squares (WCSS) for each number of clusters: wcss = [] for i in range(1, 11): clustering = KMeans(n_clusters=i, init='k-means++', random_state=42) clustering. fit(df) wcss.

Why is elbow method required in clustering?

In cluster analysis, the elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use.


3 Answers

you can use the inertia attribute of Kmeans class.

Assuming X is your dataset:

from sklearn.cluster import KMeans
from matplotlib import pyplot as plt

X = # <your_data>
distorsions = []
for k in range(2, 20):
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X)
    distorsions.append(kmeans.inertia_)

fig = plt.figure(figsize=(15, 5))
plt.plot(range(2, 20), distorsions)
plt.grid(True)
plt.title('Elbow curve')
like image 70
Ahmed Besbes Avatar answered Oct 21 '22 06:10

Ahmed Besbes


You had some syntax problems in the code. They should be fixed now:

Ks = range(1, 10)
km = [KMeans(n_clusters=i) for i in Ks]
score = [km[i].fit(my_matrix).score(my_matrix) for i in range(len(km))]

The fit method just returns a self object. In this line in the original code

cluster_array = [km[i].fit(my_matrix)]

the cluster_array would end up having the same contents as km.

You can use the score method to get the estimate for how well the clustering fits. To see the score for each cluster simply run plot(Ks, score).

like image 43
J. P. Petersen Avatar answered Oct 21 '22 06:10

J. P. Petersen


You can also use euclidean distance between the each data with the cluster center distance to evaluate how many clusters to choose. Here is the code example.

import numpy as np
from scipy.spatial.distance import cdist
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

iris = load_iris()
x = iris.data

res = list()
n_cluster = range(2,20)
for n in n_cluster:
    kmeans = KMeans(n_clusters=n)
    kmeans.fit(x)
    res.append(np.average(np.min(cdist(x, kmeans.cluster_centers_, 'euclidean'), axis=1)))

plt.plot(n_cluster, res)
plt.title('elbow curve')
plt.show()
like image 6
lugq Avatar answered Oct 21 '22 07:10

lugq