Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PCA from FactoMineR::PCA() not returning same as base::prcomp()

Tags:

r

pca

Doing some PCA analysis and when comparing with the results for FactoMineR function PCA together with prcomp from base I dont get the same results. One example

library(ISLR)
library(FactoMineR)
data("NCI60")

df <- NCI60$data


pca_prcomp <- prcomp(df, scale. = T)
pca_facto <- FactoMineR::PCA(df, scale.unit = T, graph = F, ncp = 65)


# One column is missing

dim(pca_prcomp$x)
dim(pca_facto$ind$coord) 

# Values are similiare - but not the same

head(pca_prcomp$x[, 1:2])
head(pca_facto$ind$coord[, 1:2])


# Using scale function - does not return same values

pca_facto_scale <- PCA(scale(df), scale.unit = F, graph = F, ncp = 65)

head(pca_facto$ind$coord[, 1:2], 3)
head(pca_facto_scale$ind$coord[, 1:2], 3)
like image 592
MLEN Avatar asked Nov 16 '25 23:11

MLEN


1 Answers

Sorry for being late, the FactoMineR package uses the same approach of svd() which should be similar (but not identical) with the prcomp() approach and both of them are listed under the Q-mode, which is the preferred method to do PCA for its numerical accuracy. But note, I didn't say identical, why? FactoMineR uses its own algorithm for PCA where it calculates the number of components like the following:

ncp <- min(ncp, nrow(X) - 1, ncol(X))

which tells you clearly why you got number of components 63 not 64 as what prcomp() would normally give. Your data set is typical of genomics data where you have n rows smaller than p columns of genes and the above code will clearly take columns or rows, whichever has the less number. If you follow the svd() algorithm it will return 64 dimensions not 63.

To explore the source code further type FactoMineR:::PCA.

For differences between the Q-mode (svd, prcomp(), FactoMineR::PCA()) and R-mode (eigen(), princomp()) I would recommend visiting this answer.

Side note: for prcomp() you want pass the center = T argument in order to center your data before doing PCA. Scaling on the other hand will give all your gene columns equal weight.

pca_prcomp <- prcomp(df, center = T, scale. = T) # add center=T

For scaling, the prcomp() use N as the divisor while FactoMineR::PCA() uses N-1 instead. The code below will prove it (refer to the same linked answer above):

# this is the scaled data by scale()
df_scaled <- scale(df)

# then you need to get the standardized data matrix from the output of the FactoMineR::PCR() function, which can be done easily as follows:
df_restored <- pca_facto$svd$U %*% diag(pca_facto$svd$vs) %*% t(pca_facto$svd$V)

# the to make both FactoMineR::PCR() and scale() match up you need to do the correction
df_corrected <- df_restored * sqrt(63 / 64) # correct for sqrt(N-1/N)

head(df[, 1:5]) # glimpse the first five columns only!
head(df_scaled[, 1:5])
head(df_restored[, 1:5]) # glimpse the first five columns only!
head(df_corrected[, 1:5])
round(head(df_scaled[, 1:5]), 3) == round(head(df_corrected[, 1:5]), 3) # TRUE

R> head(df[, 1:5])
       1      2      3      4      5
V1 0.300  1.180  0.550  1.140 -0.265
V2 0.680  1.290  0.170  0.380  0.465
V3 0.940 -0.040 -0.170 -0.040 -0.605
V4 0.280 -0.310  0.680 -0.810  0.625
V5 0.485 -0.465  0.395  0.905  0.200
V6 0.310 -0.030 -0.100 -0.460 -0.205
R> head(df_scaled[, 1:5])
       1        2      3      4      5
V1 0.723  1.59461  1.315  1.345 -0.600
V2 1.584  1.73979  0.438  0.649  0.905
V3 2.173 -0.01609 -0.346  0.264 -1.301
V4 0.678 -0.37256  1.615 -0.441  1.235
V5 1.142 -0.57720  0.958  1.130  0.359
V6 0.746 -0.00289 -0.185 -0.120 -0.476
R> head(df_restored[, 1:5])
      [,1]     [,2]   [,3]   [,4]   [,5]
[1,] 0.729  1.60722  1.326  1.356 -0.605
[2,] 1.596  1.75354  0.442  0.654  0.912
[3,] 2.190 -0.01622 -0.349  0.266 -1.311
[4,] 0.683 -0.37550  1.628 -0.444  1.244
[5,] 1.151 -0.58176  0.965  1.139  0.361
[6,] 0.752 -0.00291 -0.186 -0.121 -0.480
R> head(df_corrected[, 1:5])
      [,1]     [,2]   [,3]   [,4]   [,5]
[1,] 0.723  1.59461  1.315  1.345 -0.600
[2,] 1.584  1.73979  0.438  0.649  0.905
[3,] 2.173 -0.01609 -0.346  0.264 -1.301
[4,] 0.678 -0.37256  1.615 -0.441  1.235
[5,] 1.142 -0.57720  0.958  1.130  0.359
[6,] 0.746 -0.00289 -0.185 -0.120 -0.476
R> round(head(df_scaled[, 1:5]), 3) == round(head(df_corrected[, 1:5]), 3)
      1    2    3    4    5
V1 TRUE TRUE TRUE TRUE TRUE
V2 TRUE TRUE TRUE TRUE TRUE
V3 TRUE TRUE TRUE TRUE TRUE
V4 TRUE TRUE TRUE TRUE TRUE
V5 TRUE TRUE TRUE TRUE TRUE
V6 TRUE TRUE TRUE TRUE TRUE

Book excerpt

There is also the book for FactoMineR package called "Exploratory Multivariate Analysis by Example Using R" 2nd edition by François Husson, Sébastien, and Lê Jérôme Pagès. Below is an excerpt from page 55 of the book which was discussing a data set from a genomic study similar to yours with n rows (43) far less than p 7407 columns chicken.csv data set, you can see more info in their website as well as the data set itself can be downloaded from this link.

enter image description here

like image 58
doctorate Avatar answered Nov 18 '25 14:11

doctorate



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!