Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to conduct PCA on each group for a dataset with multiple groups?

Tags:

r

pca

princomp

I have a dataset of individuals from four populations, four treatments and three replicates. Each individual is in only one population, treatment and replicate combination. I have taken four measurements from each individual. I would like to conduct a PCA on these measurement for each population, substrate, and replicate combination.

I am aware of how to do a PCA on all individuals and I can split the dataset into multiple datasets for each combination of population, substrate, and replicate and then perform the PCA on each new dataset.

How can I conduct a PCA on the complete dataset getting separate PC1, PC2... results for each combination of population, substrate, and replicate the most efficiently? I have a thought about converting the dataset to a list, but unsure how to apply the princomp function to a list. Am I on the right track?

Sample data:

TestData<- structure(list(Location = c("A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
                                   "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B",
                                   "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C",
                                   "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D"),
              Substrate = c("A", "B", "C", "D", "A", "B", "C", "D", "A", "B", "C", "D",
                            "A", "B", "C", "D", "A", "B", "C", "D", "A", "B", "C", "D",
                            "A", "B", "C", "D", "A", "B", "C", "D", "A", "B", "C", "D",
                            "A", "B", "C", "D", "A", "B", "C", "D", "A", "B", "C", "D"),
              Replicate = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 
                            1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 
                            1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 
                            1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), 
              Adult_Weight = c(0.0092, 0.0083, 0.0088, 0.0077, 0.0088, 0.01, 
                               0.0099, 0.011, 0.0078, 0.0086, 0.0071, 0.0093, 
                               0.0111, 0.01, 0.0097, 0.0091, 0.0083, 0.0098,
                               0.0093, 0.009, 0.0114, 0.0087, 0.0094, 0.0096, 
                               0.0099, 0.0105, 0.0091, 0.0115, 0.0106, 0.0104, 
                               0.0113, 0.0115, 0.0107, 0.0126, 0.0106, 0.0101,
                               0.0095, 0.0113, 0.0111, 0.0118, 0.0114, 0.0123, 
                               0.0119, 0.0103, 0.0119, 0.0116, 0.0112, 0.0114), 
              Adult_Thorax_Width = c(1.31, 1.31, 1.43, 1.45, 1.52, 1.43, 1.57, 1.45, 1.43, 1.54, 1.32, 1.49, 
                                     1.58, 1.36, 1.42, 1.45, 1.48, 1.38, 1.55, 1.46, 1.52, 1.42, 1.6, 1.49, 
                                     1.48, 1.58, 1.51, 1.53, 1.54, 1.76, 1.63, 1.62, 1.44, 1.51, 1.53, 1.58, 
                                     1.46, 1.94, 1.54, 2.09, 1.5, 1.65, 1.86, 1.54, 1.8, 1.98, 1.82, 1.63), 
              Adult_Wing_Length = c(1359L, 1377L, 1555L, 1559L, 1562L, 1578L, 1580L, 1588L, 1597L, 1598L, 1603L, 1605L, 
                                    1612L, 1614L, 1616L, 1617L, 1623L, 1628L, 1639L, 1642L, 1643L, 1649L, 1651L, 1652L, 
                                    1653L, 1653L, 1654L, 1656L, 1656L, 1656L, 1662L, 1664L, 1665L, 1668L, 1670L, 1670L, 
                                    1671L, 1672L, 1674L, 1682L, 1685L, 1687L, 1688L, 1694L, 1698L, 1698L, 1707L, 1708L), 
              Adult_Leg_Length = c(414L, 390L, 627L, 541L, 430L, 450L, 451L, 462L, 443L, 582L, 435L, 579L, 
                                   499L, 418L, 444L, 646L, 589L, 466L, 435L, 477L, 450L, 606L, 660L, 450L, 
                                   446L, 480L, 462L, 438L, 483L, 454L, 492L, 457L, 463L, 499L, 470L, 474L, 
                                   627L, 478L, 473L, 496L, 666L, 499L, 480L, 461L, 450L, 483L, 460L, 584L)),
              .Names = c("Location", "Substrate", "Replicate", "Weight", "Thorax_Width", "Wing_Length", "Leg_Length"),
              row.names = c(NA, 48L), 
              class = "data.frame")
like image 677
Keith W. Larson Avatar asked Oct 10 '14 10:10

Keith W. Larson


1 Answers

You should input your population and treatments as factorial variables, and have the three replicates as separate rows, if I understood your data composition correctly. Column types would be something like:

  • 1st column population: factor
  • 2nd column treatment: factor
  • 3rd - 6th column measurement: numeric (total 4 columns)

And the overall data class should be preferably 'data.frame', because in 'data.frame' your columns may have different class types (unlike in 'matrix' for example).

Here is an example that stratifies the example Iris-dataset according to a factorial variable, here 'iris$Species'. If you have multiple factors that you want to stratify for, you could use a two (or more) columned matrix as an input for the INDICES argument. Are you sure you don't really mean a single PCA with annotations though? This could be easily be done by changing your factor-type variables to numbers and annotate them in the scatterplot e.g. through 'col' (=color) and 'pch' (=symbol) parameters.

data(iris) # Load the example Iris-dataset
class(iris)
lapply(iris, FUN=class)
#> class(iris)
#[1] "data.frame"
#> 
#> lapply(iris, FUN=class)
#$Sepal.Length
#[1] "numeric"
#
#$Sepal.Width
#[1] "numeric"
#
#$Petal.Length
#[1] "numeric"
#
#$Petal.Width
#[1] "numeric"
#
#$Species
#[1] "factor"

par(mfrow=c(2,2), mar=c(4,4,2,1))
# Separate PCA plot for each Species
# Apply our defined PCA-function where each unique INDICES are handled as a separate function call
by(iris, INDICES=iris$Species, FUN=function(z){
    # Use numeric fields for the PCA
    pca <- prcomp(z[,unlist(lapply(z, FUN=class))=="numeric"])
    plot(pca$x[,1:2], pch=16, main=z[1,"Species"]) # 2 first principal components
    z
})

# Color annotation
# Use numeric fields for the PCA
pca <- prcomp(iris[,unlist(lapply(iris, FUN=class))=="numeric"])
plot(pca$x[,1:2], pch=16, col=as.numeric(iris[,"Species"]), main="Color annotation") # 2 first principal components
legend("bottom", pch=16, col=unique(as.numeric(iris[,"Species"])), legend=unique(iris[,"Species"]))

PCA example

Notice that the PCA-axes are not the same in the first three panels counting from top-left. This is due to the fact that the covariance matrix in the PCA computation is not the same when computing only group-wise PCAs.

Alternatively, if you want a single PCA, but just plot the observations belonging to different categories in their own windows, you could try something in the lines of:

par(mfrow=c(1,3))
# Compute the PCA
pca <- prcomp(iris[,unlist(lapply(iris, FUN=class))=="numeric"])
# Apply a plotting function over unique values of iris$Species, notice we always plot the same 'pca' object in all categories
lapply(unique(iris$Species), FUN=function(z) { 
    plot(pca$x[which(z==iris$Species),1:2], xlim=extendrange(pca$x[,1]), ylim=extendrange(pca$x[,2]),pch=16, main=z)
})

pca2

EDIT:

Out of the help file of the 'by'-function: "INDICES: a factor or a list of factors, each of length nrow(data)."

Thus, we can stratify the data in respect to multiple factorial variables if we provide the indices within a list to the by-function. Here is an artificial example, where 'first' and 'second' are two simultaneous factors that stratify the data. This should be trivial to extend to three (or more) variables:

ex <- cbind(matrix(rnorm(400), ncol=4), first = c("A", "B"), second = c("foo", "bar", "asd", "fgh", "jkl"))
by(ex, INDICES=list(ex[,"first"], ex[,"second"]), FUN=function(z) z)
# Modify the above function provided in FUN to suit your needs
like image 147
Teemu Daniel Laajala Avatar answered Sep 28 '22 19:09

Teemu Daniel Laajala