Let's say I'm examining up to 10 clusters, with scipy I usually generate the 'elbow' plot as follows:
from scipy import cluster
cluster_array = [cluster.vq.kmeans(my_matrix, i) for i in range(1,10)]
pyplot.plot([var for (cent,var) in cluster_array])
pyplot.show()
I have since became motivated to use sklearn for clustering, however I'm not sure how to create the array needed to plot as in the scipy case. My best guess was:
from sklearn.cluster import KMeans
km = [KMeans(n_clusters=i) for i range(1,10)]
cluster_array = [km[i].fit(my_matrix)]
That unfortunately resulted in an invalid command error. What is the best way sklearn way to go about this?
Thank you
The elbow method runs k-means clustering on the dataset for a range of values for k (say from 1-10) and then for each value of k computes an average score for all clusters. By default, the distortion score is computed, the sum of square distances from each point to its assigned center.
To find the optimal value of clusters, the elbow method follows the below steps: 1 Execute the K-means clustering on a given dataset for different K values (ranging from 1-10). 2 For each value of K, calculates the WCSS value. 3 Plots a graph/curve between WCSS values and the respective number of clusters K.
The number of clusters is were the elbow bends. The x axis of the plot is the number of clusters and the y axis is the Within Clusters Sum of Squares (WCSS) for each number of clusters: wcss = [] for i in range(1, 11): clustering = KMeans(n_clusters=i, init='k-means++', random_state=42) clustering. fit(df) wcss.
In cluster analysis, the elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use.
you can use the inertia attribute of Kmeans class.
Assuming X is your dataset:
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt
X = # <your_data>
distorsions = []
for k in range(2, 20):
kmeans = KMeans(n_clusters=k)
kmeans.fit(X)
distorsions.append(kmeans.inertia_)
fig = plt.figure(figsize=(15, 5))
plt.plot(range(2, 20), distorsions)
plt.grid(True)
plt.title('Elbow curve')
You had some syntax problems in the code. They should be fixed now:
Ks = range(1, 10)
km = [KMeans(n_clusters=i) for i in Ks]
score = [km[i].fit(my_matrix).score(my_matrix) for i in range(len(km))]
The fit
method just returns a self
object. In this line in the original code
cluster_array = [km[i].fit(my_matrix)]
the cluster_array
would end up having the same contents as km
.
You can use the score
method to get the estimate for how well the clustering fits. To see the score for each cluster simply run plot(Ks, score)
.
You can also use euclidean distance between the each data with the cluster center distance to evaluate how many clusters to choose. Here is the code example.
import numpy as np
from scipy.spatial.distance import cdist
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
iris = load_iris()
x = iris.data
res = list()
n_cluster = range(2,20)
for n in n_cluster:
kmeans = KMeans(n_clusters=n)
kmeans.fit(x)
res.append(np.average(np.min(cdist(x, kmeans.cluster_centers_, 'euclidean'), axis=1)))
plt.plot(n_cluster, res)
plt.title('elbow curve')
plt.show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With