R - does `prcomp` take sample data or covariance matrices as input?

Tags:

If you search online there are a few threads that discuss the use of the covmat flag in the function princomp, which performs principal component analysis on its input. If the covmat parameter is not defined, princomp first computes the sample covariance matrix of the input.

On the other hand, there is little to no discussion about what a similar function prcomp actually does to accomplish the task of principal component analysis on data, only discussions about whether or not it is more accurate than princomp. This begs the question: does prcomp take covariance matrices or sample data matrices as input? It's not clear from the help documentation, which states (in the non-formula context):

Default S3 method:

prcomp(x, retx = TRUE, center = TRUE, scale. = FALSE, tol = NULL, ...)

x - a numeric or complex matrix (or data frame) which provides the data for the principal components analysis.

The help file does not include any examples for this method, only an example for the one documented above it which works on formula objects. The documentation sort of implies that the expected input is a covariance matrix, like so:

The calculation is done by a singular value decomposition of the (centered and possibly scaled) data matrix, not by using eigen on the covariance matrix.

However, it is unclear whether "data matrix" is a "covariance matrix", and whether or not "data matrix" means x as given in the earlier part of the documentation.

956

asked Nov 10 '13 23:11

bright-star

1 Answers

The answer, luckily, can be found in the source codes of the two functions.

First, the source for prcomp:

> stats:::prcomp.default
function (x, retx = TRUE, center = TRUE, scale. = FALSE, tol = NULL, 
    ...) 
{
    x <- as.matrix(x)
    x <- scale(x, center = center, scale = scale.)
    cen <- attr(x, "scaled:center")
    sc <- attr(x, "scaled:scale")
    if (any(sc == 0)) 
        stop("cannot rescale a constant/zero column to unit variance")
    s <- svd(x, nu = 0)
    s$d <- s$d/sqrt(max(1, nrow(x) - 1))
    if (!is.null(tol)) {
        rank <- sum(s$d > (s$d[1L] * tol))
        if (rank < ncol(x)) {
            s$v <- s$v[, 1L:rank, drop = FALSE]
            s$d <- s$d[1L:rank]
        }
    }
    dimnames(s$v) <- list(colnames(x), paste0("PC", seq_len(ncol(s$v))))
    r <- list(sdev = s$d, rotation = s$v, center = if (is.null(cen)) FALSE else cen, 
        scale = if (is.null(sc)) FALSE else sc)
    if (retx) 
        r$x <- x %*% s$v
    class(r) <- "prcomp"
    r
}

Notice that no covariance calculations are being performed in the upper block. The scaling and centering operations are performed on the input as provided, at which point the singular value decomposition (SVD) function is called on the result. The next step is to check the size of the result against the rank of the resulting diagonalization to ensure the result is valid. Lastly, the output is formatted and set to the appropriate class.

In other words, prcomp is a nice improvement to simply calling SVD on covariance matrices, but will not compute covariance matrices for you. ~~prcomp is not called on data, it's called on some provided estimate of covariance of some data.~~

edit: The struck-out sentence is wrong! There is no need to form the covariance matrix in this case, which I would have realized had I had my math hat on properly! For an explanation why, see this math.SO thread. Calculating the principal components with SVD on the data matrix is definitely more efficient here.

Compare to the code from princomp (only a portion shown):

if (is.list(covmat)) {
    if (any(is.na(match(c("cov", "n.obs"), names(covmat))))) 
        stop("'covmat' is not a valid covariance list")
    cv <- covmat$cov
    n.obs <- covmat$n.obs
    cen <- covmat$center
}
else if (is.matrix(covmat)) {
    if (!missing(x)) 
        warning("both 'x' and 'covmat' were supplied: 'x' will be ignored")
    cv <- covmat
    n.obs <- NA
    cen <- NULL
}
else if (is.null(covmat)) {
    dn <- dim(z)
    if (dn[1L] < dn[2L]) 
        stop("'princomp' can only be used with more units than variables")
    covmat <- cov.wt(z)
    n.obs <- covmat$n.obs
    cv <- covmat$cov * (1 - 1/n.obs)
    cen <- covmat$center
}

As you can see, the princomp function does a lot more depending on how the input is passed, which requires more care.

109

answered Sep 29 '22 11:09

bright-star

Related questions
                            
                                ggplot2 binwidth not responding in facet_wrap histogram plot
                            
                                Changing alpha doesn't affect anything in ggplot2
                            
                                Set default graphical parameters for device
                            
                                Fast grouping by list column subsets in data.table
                            
                                lagging in data.table R
                            
                                multicore and data.table in R
                            
                                Package that downloads data from the internet during installation
                            
                                Arrange many plots using gridExtra
                            
                                Is there a way to not return the `by` column as the first column in a data.table grouping
                            
                                R documentation, how to force Rcmd Rd2pdf to automatically break an overfull line in examples R-like section?
                            
                                Caching child files in knitr
                            
                                Google Trends in R [closed]
                            
                                R to Stata: Exporting dataframe with variable labels
                            
                                Why "object 'x' not found" on a line that does not use x?
                            
                                Regression kriging of binomial data
                            
                                Parallel programming using GPU in R [closed]
                            
                                Running regression tree on large dataset in R
                            
                                Parse dates in format dmy together with dmY using parse_date_time
                            
                                Adjust row height xtable R
                            
                                R - how to prevent row.names when selecting rows from a data frame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

R - does `prcomp` take sample data or covariance matrices as input?

Tags:

r

matrix

Default S3 method:

bright-star

People also ask

1 Answers

bright-star

Recent Activity

Donate For Us