Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implementing the Elbow Method for finding the optimum number of clusters for K-Means Clustering in R [closed]

Tags:

r

k-means

I want to use K-Means Clustering for my dataset. I am using the kmeans() function in R for doing this.

 k<-kmeans(data,centers=3)
 plotcluster(m,k$cluster)

However i am not sure what is the correct value of K for this function. I want to try using the Elbow Method for this. Are there any packages in R which perform clustering using the Elbow Method for finding the optimum number of clusters.

like image 814
user1946152 Avatar asked Aug 04 '13 11:08

user1946152


2 Answers

There are two question mixed up here. One is how to find a change point on a curve, and the other is about how to quantify the quality of fit when using k-means to classify data. However, the cluster-analysis folks seem to lump these two questions together. Don't be afraid of looking into other curve-fit / change point methods using whichever fit metric seems most appropriate to your case.

I know the 'elbow' method your linked to is a specific method, but you might be interested in something similar that looks for the 'knee' in the Bayesian Information Criterion (BIC). The kink in BIC versus the number of clusters (k) is the point at which you can argue that increasing BIC by adding more clusters is no longer beneficial, given the extra computational requirements of the more complex solution. There is a nice method that detects the optmimum number of clusters from the change in sign of the second derivative of the BIC. See e.g.

Zhao, Q., V. Hautamaki, and P. Franti 2008a: Knee point detection in BIC for detecting the number of clusters. Advanced Concepts for Intelligent Vision Systems, J. Blanc-Talon, S. Bourennane, W. Philips, D. Popescu, and P. Scheunders, Eds., Springer Berlin / Heidelberg, Lecture Notes in Computer Science, Vol. 5259, 664–673, doi:10.1007/978-3-540-88458-3 60.

Zhao, Q., M. Xu, and P. Franti, 2008b: Knee point detection on bayesian information criterion. Tools with Artificial Intelligence, 2008. ICTAI ’08. 20th IEEE Inter- national Conference on, Vol. 2, 431 –438, doi:10.1109/ ICTAI.2008.154

You might be interested in an automated application of this to weather data, reported in http://journals.ametsoc.org/doi/abs/10.1175/JAMC-D-11-0227.1

See also Finding the best trade-off point on a curve for an excellent discussion of the general approach.

One final observation: make sure that you are consistent in your logarithms. Different communities use different notation, and this can be a source of error when comparing results.

like image 96
Andy Clifton Avatar answered Sep 21 '22 13:09

Andy Clifton


Yes, you can find the best number of clusters using Elbow method, but I found it troublesome to find the value of clusters from elbow graph using script. You can observe the elbow graph and find the elbow point yourself, but it was lot of work finding it from script.

So another option is to use Silhouette Method to find it. The result from Silhouette completely comply with result from Elbow method.

Here`s what I did.

#Dataset for Clustering
n = 150
g = 6 
set.seed(g)
d <- data.frame(x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))), 
                y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))
mydata<-d
#Plot 3X2 plots
attach(mtcars)
par(mfrow=c(3,2))

#Plot the original dataset
plot(mydata$x,mydata$y,main="Original Dataset")

#Scree plot to deterine the number of clusters
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
  for (i in 2:15) {
    wss[i] <- sum(kmeans(mydata,centers=i)$withinss)
}   
plot(1:15, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")

# Ward Hierarchical Clustering
d <- dist(mydata, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward") 
plot(fit) # display dendogram
groups <- cutree(fit, k=5) # cut tree into 5 clusters
# draw dendogram with red borders around the 5 clusters 
rect.hclust(fit, k=5, border="red")

#Silhouette analysis for determining the number of clusters
library(fpc)
asw <- numeric(20)
for (k in 2:20)
  asw[[k]] <- pam(mydata, k) $ silinfo $ avg.width
k.best <- which.max(asw)

cat("silhouette-optimal number of clusters:", k.best, "\n")
plot(pam(d, k.best))

# K-Means Cluster Analysis
fit <- kmeans(mydata,k.best)
mydata 
# get cluster means 
aggregate(mydata,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydata <- data.frame(mydata, clusterid=fit$cluster)
plot(mydata$x,mydata$y, col = fit$cluster, main="K-means Clustering results")

Hope it helps!!

like image 45
Udeep Shakya Avatar answered Sep 20 '22 13:09

Udeep Shakya