Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you manually compute for silhouette, cohesion and separation of Cluster

Good day!

I have been looking all over the Internet on how to compute for silhouette coefficient, cohesion and separation unfortunately, despite the resources, I just can't understand the formulas posted. I know that there are implementations of it in some tool, but I want to know how to manually compute them especially given a vector space model.

Assuming that I have the following clusters:

Cluster 1 ={{1,0},{1,1}}
Cluster 2 ={{1,2},{2,3},{2,2},{1,2}},
Cluster 3 ={{3,1},{3,3},{2,1}}

The way I understood it according to [1] is that I have to get the average of the points per cluster:

C1 X = 1; Y = .5
C2 X = 1.5; Y = 2.25
C3 X = 2.67; Y = 1.67

Given the mean, I have to compute for my cohesion by Sum of Square Error (SSE):

Cohesion(C1) = (1-1)^2 + (1-1)^2 + (0-.5)^2 + (0-.5)^2 = 0.5
Cohesion(C2) = (1-1.5)^2 + (2-1.5)^2 + (2-1.5)^2 + (1-1.5)^2 + (2-2.5)^2 + (3-2.5)^2 + (2-2.5)^2 +(2-2.5)^2 = 2
Cohesion(C3) = (3-2.67)^2 + (3-2.67)^2 + (2-2.67)^2 + (1-1.67)^2 + (3-1.67)^2 + (1-1.67)^2 = 3.3334

Cluster(C) = 0.5 + 2 + 3.3334 = 5.8334

My questions are:
1. Did I perform cohesion correctly?
2. How do I compute for Separation?
3. How do I compute for Silhouette Coefficient?

Thank you.


References:
[1] http://www.cs.kent.edu/~jin/DM08/ClusterValidation.pdf

like image 271
asker Avatar asked Apr 30 '14 11:04

asker


People also ask

How do you manually calculate coefficients in silhouette?

The Silhouette Coefficient is calculated using the mean intra-cluster distance ( a ) and the mean nearest-cluster distance ( b ) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b) . To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of.

How do you assess cluster validity on the basis of silhouette index?

The Silhouette validation technique calculates the silhouette index for each sample, average silhouette index for each cluster and overall average silhouette index for a dataset. Using the approach each cluster could be represented by Silhouette index, which is based on the comparison of its tightness and separation.

What is silhouette method in clustering?

The silhouette Method is also a method to find the optimal number of clusters and interpretation and validation of consistency within clusters of data. The silhouette method computes silhouette coefficients of each point that measure how much a point is similar to its own cluster compared to other clusters.

What is a good silhouette score for clustering?

The value of the silhouette coefficient is between [-1, 1]. A score of 1 denotes the best meaning that the data point i is very compact within the cluster to which it belongs and far away from the other clusters. The worst value is -1. Values near 0 denote overlapping clusters.


1 Answers

Cluster 1 ={{1,0},{1,1}} 
Cluster 2 ={{1,2},{2,3},{2,2},{1,2}}, 
Cluster 3 ={{3,1},{3,3},{2,1}}

Take a point {1,0} in cluster 1

Calculate its average distance to all other points in it’s cluster, i.e. cluster 1

So a1 =√( (1-1)^2 + (0-1)^2) =√(0+1)=√1=1

Now for the object {1,0} in cluster 1 calculate its average distance from all the objects in cluster 2 and cluster 3. Of these take the minimum average distance.

So for cluster 2

{1,0} ----> {1,2} = distance = √((1-1)^2 + (0-2)^2) =√(0+4)=√4=2
{1,0} ----> {2,3} = distance = √((1-2)^2 + (0-3)^2) =√(1+9)=√10=3.16
{1,0} ----> {2,2} = distance = √((1-2)^2 + (0-2)^2) =√(1+4)=√5=2.24
{1,0} ----> {1,2} = distance = √((1-1)^2 + (0-2)^2) =√(0+4)=√4=2

Therefore, the average distance of point {1,0} in cluster 1 to all the points in cluster 2 =

(2+3.16+2.24+2)/4 = 2.325

Similarly, for cluster 3

{1,0} ----> {3,1} = distance = √((1-3)^2 + (0-1)^2) =√(4+1)=√5=2.24
{1,0} ----> {3,3} = distance = √((1-3)^2 + (0-3)^2) =√(4+9)=√13=3.61
{1,0} ----> {2,1} = distance = √((1-2)^2 + (0-1)^2) =√(1+1)=√2=2.24

Therefore, the average distance of point {1,0} in cluster 1 to all the points in cluster 3 =

(2.24+3.61+2.24)/3 = 2.7

Now, the minimum average distance of the point {1,0} in cluster 1 to the other clusters 2 and 3 is,

b1 =2.325 (2.325 < 2.7)

So the silhouette coefficient of cluster 1

s1= 1-(a1/b1) = 1- (1/2.325)=1-0.4301=0.5699

In a similar fashion you need to calculate the silhouette coefficient for cluster 2 and cluster 3 separately by taking any single object point in each of the clusters and repeating the steps above. Of these the cluster with the greatest silhouette coefficient is the best as per evaluation.

Note: The distance here is the Euclidean Distance! You can also have a look at this video for further explanation:

https://www.coursera.org/learn/cluster-analysis/lecture/RJJfM/6-2-clustering-evaluation-measuring-clustering-quality

like image 136
raikumardipak Avatar answered Oct 05 '22 03:10

raikumardipak