Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

principal component analysis (PCA) in R: which function to use?

Can anyone explain what the major differences between the prcomp and princomp functions are?

Is there any particular reason why I should choose one over the other? In case this is relevant, the type of application I am looking at is a quality control analysis for genomic (expression) data sets.

Thank you!

like image 294
AndraD Avatar asked Jan 10 '13 00:01

AndraD


People also ask

What is the function of principal component analysis?

Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance.

How do I choose a PCA component?

Choosing the Principal Components The common way of selecting the Principal Components to be used is to set a threshold of explained variance, such as 80%, and then select the number of components that generate a cumulative sum of explained variance as close as possible of that threshold.

What type of data is principal component analysis PCA best used on?

The most important use of PCA is to represent a multivariate data table as smaller set of variables (summary indices) in order to observe trends, jumps, clusters and outliers. This overview may uncover the relationships between observations and variables, and among the variables.


1 Answers

There are differences between these two functions w/r/t

  • the function parameters (what you can/must pass in when you call the function);
  • the values returned by each; and
  • the numerical technique used by each to calculate principal components.


Numerical Technique Used to Calculate PCA

In particular, princomp should be a lot faster (and the performance difference will increase with the size of the data matrix) given that it calculates principal components via eigenvector decomposition on the covariance matrix, versus prcomp which calculates principal components via singular value decomposition (SVD) on the original data matrix.

Eigenvalue decomp is only defined for square matrices (because the the technique is just solving the characteristic polynomial) but that's not a practical limitation because the eigenvalue decomp always involves the predicate step of calculating from the original data matrix, the covariance matrix.

Not only is the covariance matrix square, but is is usually much smaller than the original data matrix (as long as the number of attributes is less than the number of rows, or n < m, which is true in most of the time.

The former (eigenvector decomp) is less accurate (the difference is often not material), but much faster because computation is performed on the covariance matrix rather than on the original data matrix; so for instance, if the data matrix has the usual shape such that n >> m, i.e., 1000 rows and 10 columns, then the covariance matrix is 10 x 10; by contrast prcomp calculates SVD on the original 1000 x 10 matrix.

I don't know the shape of data matrices for genomic expression data, but if the rows are in the thousands or even hundreds, then prcomp will be noticeably slower than princomp. I don't know your context, eg, whether pca is performed as a single step in a larger data flow and whether net performance (execution speed) is of concern, so i can't say whether this performance is indeed relevant for your use case. Likewise, it's difficult to say whether the difference in numerical accuracy between the two techniques is significant and in fact it depends on the data.

Return Values

princomp returns a list comprised of seven items; prcomp returns a list of five.

> names(pc1)    # prcomp
    [1] "sdev"     "rotation" "center"   "scale"    "x"       

> names(pc2)    # princomp
    [1] "sdev"     "loadings" "center"   "scale"    "n.obs"    "scores"   "call"    

For princomp, the most important items returnd are component scores and loadings.

The values returned by the two functions can be reconciled (compared) this way: prcomp returns, among other things, a matrix called rotation which is equivalent to the loadings matrix returned by princomp.

if you multiply prcomp's rotation matrix by the original data matrix the result is stored in the matrix keyed to x

finally, prcomp has a plot method which gives a scree plot (shows the relative and cumulative importance of each variable/column--the most useful visualization of PCA in my opinion).

Function Arguments

prcomp will scale (to unit variance) and mean center your data for you if you set to TRUE the arguments scale and center. That's a trivial difference between the two given that you can both scale and mean center your data in a single line using the scale function.

like image 181
doug Avatar answered Sep 30 '22 20:09

doug