Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - does `prcomp` take sample data or covariance matrices as input?

Tags:

r

matrix

If you search online there are a few threads that discuss the use of the covmat flag in the function princomp, which performs principal component analysis on its input. If the covmat parameter is not defined, princomp first computes the sample covariance matrix of the input.

On the other hand, there is little to no discussion about what a similar function prcomp actually does to accomplish the task of principal component analysis on data, only discussions about whether or not it is more accurate than princomp. This begs the question: does prcomp take covariance matrices or sample data matrices as input? It's not clear from the help documentation, which states (in the non-formula context):

Default S3 method:

prcomp(x, retx = TRUE, center = TRUE, scale. = FALSE, tol = NULL, ...)

x - a numeric or complex matrix (or data frame) which provides the data for the principal components analysis.

The help file does not include any examples for this method, only an example for the one documented above it which works on formula objects. The documentation sort of implies that the expected input is a covariance matrix, like so:

The calculation is done by a singular value decomposition of the (centered and possibly scaled) data matrix, not by using eigen on the covariance matrix.

However, it is unclear whether "data matrix" is a "covariance matrix", and whether or not "data matrix" means x as given in the earlier part of the documentation.

like image 956
bright-star Avatar asked Nov 10 '13 23:11

bright-star


People also ask

Does Prcomp use correlation or covariance matrix?

princomp can call eigen on the correlation or covariance matrix. Its default calculation uses divisor N for the covariance matrix.

What values does Prcomp () function return?

The prcomp function returns an object of class prcomp, which have some methods available. The print method returns the standard deviation of each of the four PCs, and their rotation (or loadings), which are the coefficients of the linear combinations of the continuous variables.

How does Prcomp in R work?

The prcomp function takes in the data as input, and it is highly recommended to set the argument scale=TRUE. This standardize the input data so that it has zero mean and variance one before doing PCA. We have stored the results from prcomp and the resulting object has many useful variables associated with the analysis.

On what kind of data you should use PCA to get the best results?

PCA works best on data set having 3 or higher dimensions. Because, with higher dimensions, it becomes increasingly difficult to make interpretations from the resultant cloud of data. PCA is applied on a data set with numeric variables. PCA is a tool which helps to produce better visualizations of high dimensional data.


1 Answers

The answer, luckily, can be found in the source codes of the two functions.

First, the source for prcomp:

> stats:::prcomp.default
function (x, retx = TRUE, center = TRUE, scale. = FALSE, tol = NULL, 
    ...) 
{
    x <- as.matrix(x)
    x <- scale(x, center = center, scale = scale.)
    cen <- attr(x, "scaled:center")
    sc <- attr(x, "scaled:scale")
    if (any(sc == 0)) 
        stop("cannot rescale a constant/zero column to unit variance")
    s <- svd(x, nu = 0)
    s$d <- s$d/sqrt(max(1, nrow(x) - 1))
    if (!is.null(tol)) {
        rank <- sum(s$d > (s$d[1L] * tol))
        if (rank < ncol(x)) {
            s$v <- s$v[, 1L:rank, drop = FALSE]
            s$d <- s$d[1L:rank]
        }
    }
    dimnames(s$v) <- list(colnames(x), paste0("PC", seq_len(ncol(s$v))))
    r <- list(sdev = s$d, rotation = s$v, center = if (is.null(cen)) FALSE else cen, 
        scale = if (is.null(sc)) FALSE else sc)
    if (retx) 
        r$x <- x %*% s$v
    class(r) <- "prcomp"
    r
}

Notice that no covariance calculations are being performed in the upper block. The scaling and centering operations are performed on the input as provided, at which point the singular value decomposition (SVD) function is called on the result. The next step is to check the size of the result against the rank of the resulting diagonalization to ensure the result is valid. Lastly, the output is formatted and set to the appropriate class.

In other words, prcomp is a nice improvement to simply calling SVD on covariance matrices, but will not compute covariance matrices for you. prcomp is not called on data, it's called on some provided estimate of covariance of some data.

edit: The struck-out sentence is wrong! There is no need to form the covariance matrix in this case, which I would have realized had I had my math hat on properly! For an explanation why, see this math.SO thread. Calculating the principal components with SVD on the data matrix is definitely more efficient here.

Compare to the code from princomp (only a portion shown):

if (is.list(covmat)) {
    if (any(is.na(match(c("cov", "n.obs"), names(covmat))))) 
        stop("'covmat' is not a valid covariance list")
    cv <- covmat$cov
    n.obs <- covmat$n.obs
    cen <- covmat$center
}
else if (is.matrix(covmat)) {
    if (!missing(x)) 
        warning("both 'x' and 'covmat' were supplied: 'x' will be ignored")
    cv <- covmat
    n.obs <- NA
    cen <- NULL
}
else if (is.null(covmat)) {
    dn <- dim(z)
    if (dn[1L] < dn[2L]) 
        stop("'princomp' can only be used with more units than variables")
    covmat <- cov.wt(z)
    n.obs <- covmat$n.obs
    cv <- covmat$cov * (1 - 1/n.obs)
    cen <- covmat$center
}

As you can see, the princomp function does a lot more depending on how the input is passed, which requires more care.

like image 109
bright-star Avatar answered Sep 29 '22 11:09

bright-star