Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determining accuracy for k-means clustering

I want to classify Iris flower dataset (I removed labels though, so its an unlabeled data now) using sklearns k-means clustering function. I have made the prediction model and the output seems to be classifying the data correctly for the most part, however it is choosing the labels randomly (0, 1 and 2) and I cannot compare it to my own labels to determine the accuracy (I have marked setosa as 0, versicolor as 1, virginica as 2). Is there any way to correctly label the flowers?

Heres the code:

from sklearn.cluster import KMeans
cluster = KMeans(n_clusters = 3)
cluster.fit(features)
pred = cluster.labels_
score = round(accuracy_score(pred, name_val), 4)
print('Accuracy scored using k-means clustering: ', score)

features, as expected contains the features, name_val is matrix containing flower values, 0 for setosa, 1 for versicolor, 2 for virginica

Edit: one solution I came up with was setting random_state to any number so that the labeling is constant, is there any other solution?

like image 541
Ach113 Avatar asked Jan 02 '23 03:01

Ach113


2 Answers

You need to take a look at clustering metrics to evaluate your predicitons, these include

  1. Homegenity Score
  2. V measure
  3. Completenss Score and so on

Now take Completeness Score for example,

A clustering result satisfies completeness if all the data points that are members of a given class are elements of the same cluster.

For example

from sklearn.metrics.cluster import completeness_score
print completeness_score([0, 0, 1, 1], [1, 1, 0, 0])
#Output : 1.0

Which similar to what you want. For you the code would be completeness_score(pred, name_val). Here note that the label assigned to a data point is not important rather their labelling with respect to each other is important.

Homogenity on the other hand focus on the quality of data points within the same cluster. Whereas, V-measure is defined as 2 * (homogeneity * completeness) / (homogeneity + completeness)

Read the official documentation here : Homogenity, completeness and V-measure

like image 167
Gambit1614 Avatar answered Jan 05 '23 18:01

Gambit1614


First of all, you are not classifying, you are clustering the data. Classification is a different process.

The K-Means algorithm includes randomness in choosing the initial cluster centers. By setting the random_state you manage to reproduce the same clustering, as the initial cluster centers will be the same. However, this does not fix your problem. What you want is the cluster with id 0 to be setosa, 1 to be versicolor etc. This is not possible because the K-Means algorithm has no knowledge of these categories, it only groups flowers depending on their similarity. What you can do is create a rule to determine which cluster corresponds to which category. For example you can say that if more than 50% of the flowers that belong to a cluster are also in the setosa category, then this cluster's documents should be compared to the set of documents in the setosa category.

That's the best way of doing it, that I can think of. However, this is not the way we evaluate custering quality, there are metrics you can use such as the Silhouette Coefficient. I hope I helped.

like image 35
Theo Vasileiadis Avatar answered Jan 05 '23 18:01

Theo Vasileiadis