I'd like to use silhouette score in my script, to automatically compute number of clusters in k-means clustering from sklearn.
import numpy as np
import pandas as pd
import csv
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
filename = "CSV_BIG.csv"
# Read the CSV file with the Pandas lib.
path_dir = ".\\"
dataframe = pd.read_csv(path_dir + filename, encoding = "utf-8", sep = ';' ) # "ISO-8859-1")
df = dataframe.copy(deep=True)
#Use silhouette score
range_n_clusters = list (range(2,10))
print ("Number of clusters from 2 to 9: \n", range_n_clusters)
for n_clusters in range_n_clusters:
clusterer = KMeans (n_clusters=n_clusters).fit(?)
preds = clusterer.predict(?)
centers = clusterer.cluster_centers_
score = silhouette_score (?, preds, metric='euclidean')
print ("For n_clusters = {}, silhouette score is {})".format(n_clusters, score)
Someone can help me with question marks? I don't understand what to put instead of question marks. I have taken the code from an example. The commented part is the previous versione, where I do k-means clustering with a fixed number of clusters set to 4. The code in this way is correct, but in my project I need to automatically chose the number of clusters.
The Silhouette Coefficient is calculated using the mean intra-cluster distance ( a ) and the mean nearest-cluster distance ( b ) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b) . To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of.
The value of the silhouette coefficient is between [-1, 1]. A score of 1 denotes the best meaning that the data point i is very compact within the cluster to which it belongs and far away from the other clusters. The worst value is -1. Values near 0 denote overlapping clusters.
Silhouette analysis can be used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually.
The KMeans class from the sklearn. cluster module from the Scikit-learn library is used for k-means clustering. You can see that the class is imported in the following script. The make_blobs() method from the sklearn.
I am assuming you are going to silhouette score to get the optimal no. of clusters.
First declare a seperate object of KMeans
and then call it's fit_predict
functions over your data df
like this
for n_clusters in range_n_clusters:
clusterer = KMeans(n_clusters=n_clusters)
preds = clusterer.fit_predict(df)
centers = clusterer.cluster_centers_
score = silhouette_score(df, preds)
print("For n_clusters = {}, silhouette score is {})".format(n_clusters, score))
See this official example for more clarity.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With