If you search online there are a few threads that discuss the use of the covmat
flag in the function princomp
, which performs principal component analysis on its input. If the covmat
parameter is not defined, princomp
first computes the sample covariance matrix of the input.
On the other hand, there is little to no discussion about what a similar function prcomp
actually does to accomplish the task of principal component analysis on data, only discussions about whether or not it is more accurate than princomp
. This begs the question: does prcomp
take covariance matrices or sample data matrices as input? It's not clear from the help documentation, which states (in the non-formula context):
Default S3 method:
prcomp(x, retx = TRUE, center = TRUE, scale. = FALSE, tol = NULL, ...)
x
- a numeric or complex matrix (or data frame) which provides the data for the principal components analysis.
The help file does not include any examples for this method, only an example for the one documented above it which works on formula objects. The documentation sort of implies that the expected input is a covariance matrix, like so:
The calculation is done by a singular value decomposition of the (centered and possibly scaled) data matrix, not by using
eigen
on the covariance matrix.
However, it is unclear whether "data matrix" is a "covariance matrix", and whether or not "data matrix" means x
as given in the earlier part of the documentation.
princomp can call eigen on the correlation or covariance matrix. Its default calculation uses divisor N for the covariance matrix.
The prcomp function returns an object of class prcomp, which have some methods available. The print method returns the standard deviation of each of the four PCs, and their rotation (or loadings), which are the coefficients of the linear combinations of the continuous variables.
The prcomp function takes in the data as input, and it is highly recommended to set the argument scale=TRUE. This standardize the input data so that it has zero mean and variance one before doing PCA. We have stored the results from prcomp and the resulting object has many useful variables associated with the analysis.
PCA works best on data set having 3 or higher dimensions. Because, with higher dimensions, it becomes increasingly difficult to make interpretations from the resultant cloud of data. PCA is applied on a data set with numeric variables. PCA is a tool which helps to produce better visualizations of high dimensional data.
The answer, luckily, can be found in the source codes of the two functions.
First, the source for prcomp
:
> stats:::prcomp.default
function (x, retx = TRUE, center = TRUE, scale. = FALSE, tol = NULL,
...)
{
x <- as.matrix(x)
x <- scale(x, center = center, scale = scale.)
cen <- attr(x, "scaled:center")
sc <- attr(x, "scaled:scale")
if (any(sc == 0))
stop("cannot rescale a constant/zero column to unit variance")
s <- svd(x, nu = 0)
s$d <- s$d/sqrt(max(1, nrow(x) - 1))
if (!is.null(tol)) {
rank <- sum(s$d > (s$d[1L] * tol))
if (rank < ncol(x)) {
s$v <- s$v[, 1L:rank, drop = FALSE]
s$d <- s$d[1L:rank]
}
}
dimnames(s$v) <- list(colnames(x), paste0("PC", seq_len(ncol(s$v))))
r <- list(sdev = s$d, rotation = s$v, center = if (is.null(cen)) FALSE else cen,
scale = if (is.null(sc)) FALSE else sc)
if (retx)
r$x <- x %*% s$v
class(r) <- "prcomp"
r
}
Notice that no covariance calculations are being performed in the upper block. The scaling and centering operations are performed on the input as provided, at which point the singular value decomposition (SVD) function is called on the result. The next step is to check the size of the result against the rank of the resulting diagonalization to ensure the result is valid. Lastly, the output is formatted and set to the appropriate class.
In other words, prcomp
is a nice improvement to simply calling SVD on covariance matrices, but will not compute covariance matrices for you. prcomp
is not called on data, it's called on some provided estimate of covariance of some data.
edit: The struck-out sentence is wrong! There is no need to form the covariance matrix in this case, which I would have realized had I had my math hat on properly! For an explanation why, see this math.SO thread. Calculating the principal components with SVD on the data matrix is definitely more efficient here.
Compare to the code from princomp
(only a portion shown):
if (is.list(covmat)) {
if (any(is.na(match(c("cov", "n.obs"), names(covmat)))))
stop("'covmat' is not a valid covariance list")
cv <- covmat$cov
n.obs <- covmat$n.obs
cen <- covmat$center
}
else if (is.matrix(covmat)) {
if (!missing(x))
warning("both 'x' and 'covmat' were supplied: 'x' will be ignored")
cv <- covmat
n.obs <- NA
cen <- NULL
}
else if (is.null(covmat)) {
dn <- dim(z)
if (dn[1L] < dn[2L])
stop("'princomp' can only be used with more units than variables")
covmat <- cov.wt(z)
n.obs <- covmat$n.obs
cv <- covmat$cov * (1 - 1/n.obs)
cen <- covmat$center
}
As you can see, the princomp
function does a lot more depending on how the input is passed, which requires more care.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With