Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Test significance of clusters on a PCA plot

Tags:

r

statistics

pca

Is it possible to test the significance of clustering between 2 known groups on a PCA plot? To test how close they are or the amount of spread (variance) and the amount of overlap between clusters etc.

like image 596
rmf Avatar asked Nov 28 '13 07:11

rmf


People also ask

How can we use clustering data in a PCA plot?

We can take the output of a clustering method, that is, take the clustering memberships of individuals, and use that information in a PCA plot. The location of the individuals on the first factorial plane, taking into consideration their clustering assignment, gives an excellent opportunity to “see in depth” the information contained in data.

What is principal component analysis (PCA) in k means clustering?

Implementation of Principal Component Analysis (PCA) in K Means Clustering A beginner’s approach to apply PCA using 2 components to a K Means clustering algorithm using Python and its libraries.

How do you use average characteristics in PCA plot?

In turn, the average characteristics of a group serve us to characterize all individuals in the corresponding cluster. We can take the output of a clustering method, that is, take the clustering memberships of individuals, and use that information in a PCA plot.

What is the difference between a scree plot and PCA?

A scree plot displays how much variation each principal component captures from the data A scree plot, on the other hand, is a diagnostic tool to check whether PCA works well on your data or not. Principal components are created in order of the amount of variation they cover: PC1 captures the most variation, PC2 — the second most, and so on.


2 Answers

Here is a qualitative method that uses ggplot(...) to draw 95% confidence ellipses around clusters. Note that stat_ellipse(...) uses the bivariate t-distribution.

library(ggplot2)

df     <- data.frame(iris)                   # iris dataset
pca    <- prcomp(df[,1:4], retx=T, scale.=T) # scaled pca [exclude species col]
scores <- pca$x[,1:3]                        # scores for first three PC's

# k-means clustering [assume 3 clusters]
km     <- kmeans(scores, centers=3, nstart=5)
ggdata <- data.frame(scores, Cluster=km$cluster, Species=df$Species)

# stat_ellipse is not part of the base ggplot package
source("https://raw.github.com/low-decarie/FAAV/master/r/stat-ellipse.R") 

ggplot(ggdata) +
  geom_point(aes(x=PC1, y=PC2, color=factor(Cluster)), size=5, shape=20) +
  stat_ellipse(aes(x=PC1,y=PC2,fill=factor(Cluster)),
               geom="polygon", level=0.95, alpha=0.2) +
  guides(color=guide_legend("Cluster"),fill=guide_legend("Cluster"))

Produces this:

Comparison of ggdata$Clusters and ggdata$Species shows that setosa maps perfectly to cluster 1, while versicolor dominates cluster 2 and virginica dominates cluster 3. However, there is significant overlap between clusters 2 and 3.

Thanks to Etienne Low-Decarie for posting this very useful addition to ggplot on github.

like image 126
jlhoward Avatar answered Sep 25 '22 02:09

jlhoward


You could use a PERMANOVA to partition the euclidean distance by your groups:

data(iris)
require(vegan)

# PCA
iris_c <- scale(iris[ ,1:4])
pca <- rda(iris_c)

# plot
plot(pca, type = 'n', display = 'sites')
cols <- c('red', 'blue', 'green')
points(pca, display='sites', col = cols[iris$Species], pch = 16)
ordihull(pca, groups=iris$Species)
ordispider(pca, groups = iris$Species, label = TRUE)

# PerMANOVA - partitioning the euclidean distance matrix by species
adonis(iris_c ~ Species, data = iris, method='eu')
like image 39
EDi Avatar answered Sep 24 '22 02:09

EDi