Is it possible to test the significance of clustering between 2 known groups on a PCA plot? To test how close they are or the amount of spread (variance) and the amount of overlap between clusters etc.
We can take the output of a clustering method, that is, take the clustering memberships of individuals, and use that information in a PCA plot. The location of the individuals on the first factorial plane, taking into consideration their clustering assignment, gives an excellent opportunity to “see in depth” the information contained in data.
Implementation of Principal Component Analysis (PCA) in K Means Clustering A beginner’s approach to apply PCA using 2 components to a K Means clustering algorithm using Python and its libraries.
In turn, the average characteristics of a group serve us to characterize all individuals in the corresponding cluster. We can take the output of a clustering method, that is, take the clustering memberships of individuals, and use that information in a PCA plot.
A scree plot displays how much variation each principal component captures from the data A scree plot, on the other hand, is a diagnostic tool to check whether PCA works well on your data or not. Principal components are created in order of the amount of variation they cover: PC1 captures the most variation, PC2 — the second most, and so on.
Here is a qualitative method that uses ggplot(...)
to draw 95% confidence ellipses around clusters. Note that stat_ellipse(...)
uses the bivariate t-distribution.
library(ggplot2)
df <- data.frame(iris) # iris dataset
pca <- prcomp(df[,1:4], retx=T, scale.=T) # scaled pca [exclude species col]
scores <- pca$x[,1:3] # scores for first three PC's
# k-means clustering [assume 3 clusters]
km <- kmeans(scores, centers=3, nstart=5)
ggdata <- data.frame(scores, Cluster=km$cluster, Species=df$Species)
# stat_ellipse is not part of the base ggplot package
source("https://raw.github.com/low-decarie/FAAV/master/r/stat-ellipse.R")
ggplot(ggdata) +
geom_point(aes(x=PC1, y=PC2, color=factor(Cluster)), size=5, shape=20) +
stat_ellipse(aes(x=PC1,y=PC2,fill=factor(Cluster)),
geom="polygon", level=0.95, alpha=0.2) +
guides(color=guide_legend("Cluster"),fill=guide_legend("Cluster"))
Produces this:
Comparison of ggdata$Clusters
and ggdata$Species
shows that setosa maps perfectly to cluster 1, while versicolor dominates cluster 2 and virginica dominates cluster 3. However, there is significant overlap between clusters 2 and 3.
Thanks to Etienne Low-Decarie for posting this very useful addition to ggplot
on github.
You could use a PERMANOVA to partition the euclidean distance by your groups:
data(iris)
require(vegan)
# PCA
iris_c <- scale(iris[ ,1:4])
pca <- rda(iris_c)
# plot
plot(pca, type = 'n', display = 'sites')
cols <- c('red', 'blue', 'green')
points(pca, display='sites', col = cols[iris$Species], pch = 16)
ordihull(pca, groups=iris$Species)
ordispider(pca, groups = iris$Species, label = TRUE)
# PerMANOVA - partitioning the euclidean distance matrix by species
adonis(iris_c ~ Species, data = iris, method='eu')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With