Can anyone explain what the major differences between the prcomp and princomp functions are?
Is there any particular reason why I should choose one over the other? In case this is relevant, the type of application I am looking at is a quality control analysis for genomic (expression) data sets.
Thank you!
Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance.
Choosing the Principal Components The common way of selecting the Principal Components to be used is to set a threshold of explained variance, such as 80%, and then select the number of components that generate a cumulative sum of explained variance as close as possible of that threshold.
The most important use of PCA is to represent a multivariate data table as smaller set of variables (summary indices) in order to observe trends, jumps, clusters and outliers. This overview may uncover the relationships between observations and variables, and among the variables.
There are differences between these two functions w/r/t
In particular, princomp
should be a lot faster (and the performance difference will increase with the size of the data matrix) given that it calculates principal components via eigenvector decomposition on the covariance matrix, versus prcomp which calculates principal components via singular value decomposition (SVD) on the original data matrix.
Eigenvalue decomp is only defined for square matrices (because the the technique is just solving the characteristic polynomial) but that's not a practical limitation because the eigenvalue decomp always involves the predicate step of calculating from the original data matrix, the covariance matrix.
Not only is the covariance matrix square, but is is usually much smaller than the original data matrix (as long as the number of attributes is less than the number of rows, or n < m, which is true in most of the time.
The former (eigenvector decomp) is less accurate (the difference is often not material), but much faster because computation is performed on the covariance matrix rather than on the original data matrix; so for instance, if the data matrix has the usual shape such that n >> m, i.e., 1000 rows and 10 columns, then the covariance matrix is 10 x 10; by contrast prcomp calculates SVD on the original 1000 x 10 matrix.
I don't know the shape of data matrices for genomic expression data, but if the rows are in the thousands or even hundreds, then prcomp will be noticeably slower than princomp. I don't know your context, eg, whether pca is performed as a single step in a larger data flow and whether net performance (execution speed) is of concern, so i can't say whether this performance is indeed relevant for your use case. Likewise, it's difficult to say whether the difference in numerical accuracy between the two techniques is significant and in fact it depends on the data.
princomp returns a list comprised of seven items; prcomp returns a list of five.
> names(pc1) # prcomp
[1] "sdev" "rotation" "center" "scale" "x"
> names(pc2) # princomp
[1] "sdev" "loadings" "center" "scale" "n.obs" "scores" "call"
For princomp, the most important items returnd are component scores and loadings.
The values returned by the two functions can be reconciled (compared) this way: prcomp returns, among other things, a matrix called rotation which is equivalent to the loadings matrix returned by princomp.
if you multiply prcomp's rotation matrix by the original data matrix the result is stored in the matrix keyed to x
finally, prcomp has a plot method which gives a scree plot (shows the relative and cumulative importance of each variable/column--the most useful visualization of PCA in my opinion).
prcomp
will scale (to unit variance) and mean center your data for you if you set to TRUE
the arguments scale
and center
. That's a trivial difference between the two given that you can both scale and mean center your data in a single line using the scale
function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With