Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Evaluating K-means accuracy

I created a 3-dimensional random data sets with 4 defined patterns/classes in MATLAB. I applied the K-means algorithm on the data to see how well K-means can classify my samples based on created 4 patterns/classes.

I need help with the following;

  1. What function/code can I use to evaluate how well the K-means algorithm has identified the classes of my samples correctly? Assuming I set K=4 as illustrated in the image below:

enter image description here

  1. How can I automatically identify the number of classes (K)? Assuming the classes in my data is unknown?

My aim is to evaluate K-mean's accuracy and how changes to the data (by pre-processing) affects the algorithm’s ability to identify classes. Examples with MATLAB code would be helpful!

like image 497
Young_DataAnalyst Avatar asked Dec 09 '22 04:12

Young_DataAnalyst


1 Answers

One basic metric to measure how "good" your clustering in comparison to your known class labels is called purity. Now this is an example of supervised learning where you have some idea of an external metric that is a labeling of instances based on real world data.

The mathematical definition of purity is as follows:

enter image description here

In words what this means is, quoting from a professor at Stanford university here,

To compute purity , each cluster is assigned to the class which is most frequent in the cluster, and then the accuracy of this assignment is measured by counting the number of correctly assigned documents and dividing by N.

A simple example would be if you had a very naive clustering that was produced via Kmeans with k=2 that looked like:

Cluster1    Label
  1           A         
  5           B
  7           B
  3           B
  2           B

Cluster2    Label
  4           A
  6           A
  8           A
  9           B

In Cluster1 there are 4 instances of label B and 1 instance of label A and Cluster2 has 3 instances with label A and 1 instance of cluster B. Now you are looking for the total purity so that would be the sum of the purities of each cluster, in this case k=2. So the purity of Cluster1 is the maximum number of instances in respect to the given labels divided by the total number of instances in Cluster1.

Therefore the purity of Cluster1 is:

4/5 = 0.80

The four comes from the fact that the label that occurs the most (B) occurs 4 times and there are 5 total instances in the cluster.

So this follows that the purity of Cluster2 is:

3/4 = 0.75

Now the total purity is just the sum of the purities which is 1.55. So what does this tell us? A cluster is considered to be "pure" if it has a purity of 1 since that indicates that all of the instances in that cluster are of the same label. This means your original label classification was pretty good and that your Kmeans did a pretty good job. The "best" purity score for an entire data set would be equal to your original K-number of clusters since that would imply that every cluster has an individual purity score of 1.

However, you need to be aware that purity is not always the best or most telling metric. For example, if you had 10 points and you chose k=10 then every cluster would have a purity of 1 and therefore an overall purity of 10 which equal k. In that instance it would be better to use different external metrics such as precision, recall, and F-measure. I would suggest looking into those if you can. And again to reiterate, this is only useful with supervised learning where you have pre-knowledge of a labeling system which I believe is the case from your question.

To answer your second question... choosing your K number of clusters is the most difficult part for Kmeans without any prior knowledge of the data. There are techniques as to mitigate the problems presented by choosing the initial K-number of clusters and centroids. Probably the most common is an algorithm called Kmeans++. I would suggest looking into that for further info.

like image 84
alacy Avatar answered Dec 10 '22 17:12

alacy