Test significance of clusters on a PCA plot

Tags:

Is it possible to test the significance of clustering between 2 known groups on a PCA plot? To test how close they are or the amount of spread (variance) and the amount of overlap between clusters etc.

596

asked Nov 28 '13 07:11

rmf

2 Answers

Here is a qualitative method that uses ggplot(...) to draw 95% confidence ellipses around clusters. Note that stat_ellipse(...) uses the bivariate t-distribution.

library(ggplot2)

df     <- data.frame(iris)                   # iris dataset
pca    <- prcomp(df[,1:4], retx=T, scale.=T) # scaled pca [exclude species col]
scores <- pca$x[,1:3]                        # scores for first three PC's

# k-means clustering [assume 3 clusters]
km     <- kmeans(scores, centers=3, nstart=5)
ggdata <- data.frame(scores, Cluster=km$cluster, Species=df$Species)

# stat_ellipse is not part of the base ggplot package
source("https://raw.github.com/low-decarie/FAAV/master/r/stat-ellipse.R") 

ggplot(ggdata) +
  geom_point(aes(x=PC1, y=PC2, color=factor(Cluster)), size=5, shape=20) +
  stat_ellipse(aes(x=PC1,y=PC2,fill=factor(Cluster)),
               geom="polygon", level=0.95, alpha=0.2) +
  guides(color=guide_legend("Cluster"),fill=guide_legend("Cluster"))

Produces this:

Comparison of ggdata$Clusters and ggdata$Species shows that setosa maps perfectly to cluster 1, while versicolor dominates cluster 2 and virginica dominates cluster 3. However, there is significant overlap between clusters 2 and 3.

Thanks to Etienne Low-Decarie for posting this very useful addition to ggplot on github.

126

answered Sep 25 '22 02:09

jlhoward

You could use a PERMANOVA to partition the euclidean distance by your groups:

data(iris)
require(vegan)

# PCA
iris_c <- scale(iris[ ,1:4])
pca <- rda(iris_c)

# plot
plot(pca, type = 'n', display = 'sites')
cols <- c('red', 'blue', 'green')
points(pca, display='sites', col = cols[iris$Species], pch = 16)
ordihull(pca, groups=iris$Species)
ordispider(pca, groups = iris$Species, label = TRUE)

# PerMANOVA - partitioning the euclidean distance matrix by species
adonis(iris_c ~ Species, data = iris, method='eu')

answered Sep 24 '22 02:09

EDi

Related questions
                            
                                Remove all characters before a period in a string
                            
                                knitr: include figures in report *and* output figures to separate files
                            
                                Current time in ISO 8601 format
                            
                                Shiny - how to change the font size in select tags?
                            
                                Set 0-point for pheatmap in R
                            
                                How to downgrade R version 3.2.2 to version 3.1.1 on Ubuntu
                            
                                Enum-like arguments in R
                            
                                Reading hdf files into R and converting them to geoTIFF rasters
                            
                                How to read merged excel cells with R
                            
                                Add Regression Plane to 3d Scatter Plot in Plotly
                            
                                mgcv: How to set number and / or locations of knots for splines
                            
                                Alternate Compiler for Installing R Packages: clang: error: unsupported option '-fopenmp'
                            
                                Renaming multiple columns with dplyr rename(across(
                            
                                Relationship between plotting packages in R
                            
                                dynamically expand a data frame columns using cbind
                            
                                How to construct an axis label with both normal, italic and bold font
                            
                                R - how to get a value of a multi-dimensional array by a vector of indices
                            
                                R object identity
                            
                                In R, how to prevent blank page in pdf when using gridBase to embed subplot inside plot
                            
                                Error when with Xcode 5.0 and Rcpp (Command Line Tools ARE installed)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Test significance of clusters on a PCA plot

Tags:

r

statistics

pca

rmf

People also ask

2 Answers

jlhoward

EDi

Recent Activity

Donate For Us