Scikit K-means clustering performance measure

Tags:

I'm trying to do a clustering with K-means method but I would like to measure the performance of my clustering. I'm not an expert but I am eager to learn more about clustering.

Here is my code :

import pandas as pd
from sklearn import datasets

#loading the dataset
iris = datasets.load_iris()
df = pd.DataFrame(iris.data)

#K-Means
from sklearn import cluster
k_means = cluster.KMeans(n_clusters=3)
k_means.fit(df) #K-means training
y_pred = k_means.predict(df)

#We store the K-means results in a dataframe
pred = pd.DataFrame(y_pred)
pred.columns = ['Species']

#we merge this dataframe with df
prediction = pd.concat([df,pred], axis = 1)

#We store the clusters
clus0 = prediction.loc[prediction.Species == 0]
clus1 = prediction.loc[prediction.Species == 1]
clus2 = prediction.loc[prediction.Species == 2]
k_list = [clus0.values, clus1.values,clus2.values]

Now that I have my KMeans and my three clusters stored, I'm trying to use the Dunn Index to measure the performance of my clustering (we seek the greater index) For that purpose I import the jqm_cvi package (available here)

from jqmcvi import base
base.dunn(k_list)

My question is : does any clustering internal evaluation already exists in Scikit Learn (except from silhouette_score) ? Or in another well known library ?

Thank you for your time

335

asked May 04 '17 13:05

Viphone Rathikoun

1 Answers

Apart from Silhouette Score, Elbow Criterion can be used to evaluate K-Mean clustering. It is not available as a function/method in Scikit-Learn. We need to calculate SSE to evaluate K-Means clustering using Elbow Criterion.

The idea of the Elbow Criterion method is to choose the k(no of cluster) at which the SSE decreases abruptly. The SSE is defined as the sum of the squared distance between each member of the cluster and its centroid.

Calculate Sum of Squared Error(SSE) for each value of k, where k is no. of cluster and plot the line graph. SSE tends to decrease toward 0 as we increase k (SSE=0, when k is equal to the no. of data points in the dataset, because then each data point is its own cluster, and there is no error between it and the center of its cluster).

So the goal is to choose a small value of k that still has a low SSE, and the elbow usually represents, where we start to have diminishing returns by increasing k.

Iris dataset example:

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris['feature_names'])
#print(X)
data = X[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)']]

sse = {}
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(data)
    data["clusters"] = kmeans.labels_
    #print(data["clusters"])
    sse[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()

enter image description here

If the line graph looks like an arm - a red circle in above line graph (like angle), the "elbow" on the arm is the value of optimal k (number of cluster). According to above elbow in line graph, number of optimal cluster is 3.

Note: Elbow Criterion is heuristic in nature, and may not work for your data set. Follow intuition according to dataset and the problem your are trying to solve.

Hope it helps!

answered Oct 01 '22 23:10

Om Prakash

Related questions
                            
                                Design tips for a program to be run in 25 years [closed]
                            
                                South: run a migration for a column that is both unique and not null
                            
                                how to start django shell with ipython in qtconsole mode?
                            
                                pip: inconsistent permissions issues
                            
                                How to use modern string formatting options with Python's logging module?
                            
                                Why does re.findall() find more matches than re.sub()?
                            
                                sqlalchemy bulk update performance problems
                            
                                python read_fwf error: 'dtype is not supported with python-fwf parser'
                            
                                matplotlib.pyplot vs matplotlib.pylab
                            
                                Is a Scripts directory an anti-pattern in Python? If so, what's the right way to import?
                            
                                How to create a django ViewFlow process programmatically
                            
                                Spark Java Error: Size exceeds Integer.MAX_VALUE
                            
                                PyCharm warns for unresolved reference builtin datetime module
                            
                                How to use botocore.response.StreamingBody as stdin PIPE
                            
                                How to write unit tests for python tornado application?
                            
                                How to create a pydub AudioSegment using an numpy array?
                            
                                Square root of complex numbers in python
                            
                                How to get feature names selected by feature elimination in sklearn pipeline?
                            
                                Why doesn't groupby sum convert boolean to int or float?
                            
                                NaNs when subtracting dataframes pandas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scikit K-means clustering performance measure

Tags:

python

machine-learning

cluster-analysis

scikit-learn

sklearn-pandas

Viphone Rathikoun

People also ask

1 Answers

Om Prakash

Recent Activity

Donate For Us