Hierarchical Clustering: Determine optimal number of cluster and statistically describe Clusters

Tags:

I could use some advice on methods in R to determine the optimal number of clusters and later on describe the clusters with different statistical criteria. I’m new to R with basic knowledge about the statistical foundations of cluster analysis.

Methods to determine the number of clusters: In the literature one common method to do so is the so called "Elbow-criterion" which compares the Sum of Squared Differences (SSD) for different cluster solutions. Therefore the SSD is plotted against the numbers of Cluster in the analysis and an optimal number of clusters is determined by identifying the “elbow” in the plot (e.g. here: https://en.wikipedia.org/wiki/File:DataClustering_ElbowCriterion.JPG) This method is a first approach to get a subjective impression. Therefore I’d like to implement it in R. The information on the internet on this is sparse. There is one good example here: http://www.mattpeeples.net/kmeans.html where the author also did an interesting iterative approach to see if the elbow is somehow stable after several repetitions of the clustering process (nevertheless it is for partitioning cluster methods not for hierarchical). Other methods in Literature comprise the so called “stopping rules”. MILLIGAN & COOPER compared 30 of these stopping rules in their paper “An examination of procedures for determining the number of clusters in a data set” (available here: http://link.springer.com/article/10.1007%2FBF02294245) finding that the Stopping Rule from Calinski and Harabasz provided the best results in a Monte Carlo evaluation. Information on implementing this in R is even sparser. So if anyone has ever implemented this or another Stopping rule (or other method) some advice would be very helpful.
Statistically describe the clusters:For describing the clusters I thought of using the mean and some sort of Variance Criterion. My data is on agricultural land-use and shows the production numbers of different crops per Municipality. My aim is to find similar patterns of land-use in my dataset.

I produced a script for a subset of objects to do a first test-run. It looks like this (explanations on the steps within the script, sources below).

    #Clusteranalysis agriculture

    #Load data
    agriculture <-read.table ("C:\\Users\\etc...", header=T,sep=";")
    attach(agriculture)

    #Define Dataframe to work with
    df<-data.frame(agriculture)

    #Define a Subset of objects to first test the script
    a<-df[1,]
    b<-df[2,]
    c<-df[3,]
    d<-df[4,]
    e<-df[5,]
    f<-df[6,]
    g<-df[7,]
    h<-df[8,]
    i<-df[9,]
    j<-df[10,]
    k<-df[11,]
    #Bind the objects
    aTOk<-rbind(a,b,c,d,e,f,g,h,i,j,k)

    #Calculate euclidian distances including only the columns 4 to 24
    dist.euklid<-dist(aTOk[,4:24],method="euclidean",diag=TRUE,upper=FALSE, p=2)
    print(dist.euklid)

    #Cluster with Ward
    cluster.ward<-hclust(dist.euklid,method="ward")

    #Plot the dendogramm. define Labels with labels=df$Geocode didn't work
    plot(cluster.ward, hang = -0.01, cex = 0.7)

    #here are missing methods to determine the optimal number of clusters

    #Calculate different solutions with different number of clusters
    n.cluster<-sapply(2:5, function(n.cluster)table(cutree(cluster.ward,n.cluster)))
    n.cluster

    #Show the objects within clusters for the three cluster solution
    three.cluster<-cutree(cluster.ward,3)
    sapply(unique(three.cluster), function(g)aTOk$Geocode[three.cluster==g])

    #Calculate some statistics to describe the clusters
    three.cluster.median<-aggregate(aTOk[,4:24],list(three.cluster),median)
    three.cluster.median
    three.cluster.min<-aggregate(aTOk[,4:24],list(three.cluster),min)
    three.cluster.min
    three.cluster.max<-aggregate(aTOk[,4:24],list(three.cluster),max)
    three.cluster.max
    #Summary statistics for one variable
    three.cluster.summary<-aggregate(aTOk[,4],list(three.cluster),summary)
    three.cluster.summary

    detach(agriculture)

Sources:

http://www.r-tutor.com/gpu-computing/clustering/distance-matrix
How to apply a hierarchical or k-means cluster analysis using R?
http://statistics.berkeley.edu/classes/s133/Cluster2a.html

381

asked Nov 06 '12 10:11

Joschi

1 Answers

This is a very late answer and probably not useful for the asker anymore - but maybe for others. Check out the package NbClust. It contains 26 indices that give you a recommended number of clusters (and you can also choose your type of clustering). You can run it in such a way that you get the results for all the indices and then you can basically go with the number of clusters recommended by most indices. And yes, I think the basic statistics are the best way to describe clusters.

139

answered Sep 21 '22 07:09

Geraldine

Related questions
                            
                                Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : there is no package called 'stringi' [duplicate]
                            
                                Make package in R not required to load when I startup R/RStudio?
                            
                                R: as.POSIXct timezone and scale_x_datetime issues in my dataset
                            
                                Error when installing an R package from github: Could not find build tools necessary to build data.table
                            
                                Shiny R renderText paste new line and bold
                            
                                LaTeX math expression in knitr kable (Sweave)
                            
                                For R Markdown, How do I display a matrix from R variable
                            
                                tags$style specific modaldialog element in shiny
                            
                                Join datatables using column names stored in variables
                            
                                How to display emojis in ggplot2 using emo package in R?
                            
                                How can I declare a thousand separator in read.csv? [duplicate]
                            
                                ggplot: recommended colour palettes also distinguishable for B&W printing?
                            
                                Problem with error in ggplot: “Error in grid.Call(”L_textBounds“, as.graphicsAnnot(x$label), x$x, x$y, … ” [duplicate]
                            
                                Calculation of R^2 value for a non-linear regression
                            
                                Subset a data frame based on column entry (or rank)
                            
                                R Strptime Year and Month with No Delimiter returning NA
                            
                                How can I run an 'R' script without suppressing output?
                            
                                ggplot2 add a legend for several stat_functions
                            
                                Calling .NET/C# from R
                            
                                How to put greek letters in latex knitr figure caption?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Hierarchical Clustering: Determine optimal number of cluster and statistically describe Clusters

Tags:

r

cluster-analysis

data-mining

Joschi

People also ask

1 Answers

Geraldine

Recent Activity

Donate For Us