principal component analysis (PCA) in R: which function to use?

Tags:

Can anyone explain what the major differences between the prcomp and princomp functions are?

Is there any particular reason why I should choose one over the other? In case this is relevant, the type of application I am looking at is a quality control analysis for genomic (expression) data sets.

Thank you!

294

asked Jan 10 '13 00:01

AndraD

1 Answers

There are differences between these two functions w/r/t

the function parameters (what you can/must pass in when you call the function);
the values returned by each; and
the numerical technique used by each to calculate principal components.

Numerical Technique Used to Calculate PCA

In particular, princomp should be a lot faster (and the performance difference will increase with the size of the data matrix) given that it calculates principal components via eigenvector decomposition on the covariance matrix, versus prcomp which calculates principal components via singular value decomposition (SVD) on the original data matrix.

Eigenvalue decomp is only defined for square matrices (because the the technique is just solving the characteristic polynomial) but that's not a practical limitation because the eigenvalue decomp always involves the predicate step of calculating from the original data matrix, the covariance matrix.

Not only is the covariance matrix square, but is is usually much smaller than the original data matrix (as long as the number of attributes is less than the number of rows, or n < m, which is true in most of the time.

The former (eigenvector decomp) is less accurate (the difference is often not material), but much faster because computation is performed on the covariance matrix rather than on the original data matrix; so for instance, if the data matrix has the usual shape such that n >> m, i.e., 1000 rows and 10 columns, then the covariance matrix is 10 x 10; by contrast prcomp calculates SVD on the original 1000 x 10 matrix.

I don't know the shape of data matrices for genomic expression data, but if the rows are in the thousands or even hundreds, then prcomp will be noticeably slower than princomp. I don't know your context, eg, whether pca is performed as a single step in a larger data flow and whether net performance (execution speed) is of concern, so i can't say whether this performance is indeed relevant for your use case. Likewise, it's difficult to say whether the difference in numerical accuracy between the two techniques is significant and in fact it depends on the data.

Return Values

princomp returns a list comprised of seven items; prcomp returns a list of five.

> names(pc1)    # prcomp
    [1] "sdev"     "rotation" "center"   "scale"    "x"       

> names(pc2)    # princomp
    [1] "sdev"     "loadings" "center"   "scale"    "n.obs"    "scores"   "call"

For princomp, the most important items returnd are component scores and loadings.

The values returned by the two functions can be reconciled (compared) this way: prcomp returns, among other things, a matrix called rotation which is equivalent to the loadings matrix returned by princomp.

if you multiply prcomp's rotation matrix by the original data matrix the result is stored in the matrix keyed to x

finally, prcomp has a plot method which gives a scree plot (shows the relative and cumulative importance of each variable/column--the most useful visualization of PCA in my opinion).

Function Arguments

prcomp will scale (to unit variance) and mean center your data for you if you set to TRUE the arguments scale and center. That's a trivial difference between the two given that you can both scale and mean center your data in a single line using the scale function.

181

answered Sep 30 '22 20:09

doug

Related questions
                            
                                R: Format output of write.table
                            
                                Attribute variable name to a named vector
                            
                                Using layout with knitr
                            
                                Split data.frame by value
                            
                                ggplot geom_bar - 'rotate and flip'?
                            
                                How to use R for multiple select questions?
                            
                                rep function in R
                            
                                reorder columns based on values in a particular row.
                            
                                Pass underscore in knitr R code
                            
                                How to import data and create a scatter plot in R?
                            
                                How to extract numeric values from a structure object in R
                            
                                Principal Component Analysis in R data color
                            
                                How to force older packages to install on newer versions of R?
                            
                                Regression line for the entire dataset together with regression lines based on groups in R ggplot2 ?
                            
                                How to make 'head' be applied automatically to output?
                            
                                Displaying only the p-value of multiple t.tests
                            
                                Convert table into a vector to use hist() on r
                            
                                Configure fix() and edit() to open in Notepad++ from R/RStudio
                            
                                Is it possible to use Rstudio to translate from .Rmd to LaTeX directly without pandoc?
                            
                                Avoiding Global Variables

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

principal component analysis (PCA) in R: which function to use?

Tags:

r

unsupervised-learning

linear-algebra

pca