I'm trying to do a PCA analysis of my data using R, and I found this nice guide, using prcomp
and ggbiplot
. My data is two sample types with three biological replicates each (i.e. 6 rows) and around 20000 genes (i.e. variables). First, getting the PCA model with the code described in the guide doesn't work:
>pca=prcomp(data,center=T,scale.=T)
Error in prcomp.default(data, center = T, scale. = T) :
cannot rescale a constant/zero column to unit variance
However, if I remove the scale. = T
part, it works just fine and I get a model. Why is this, and is this the cause of the error below?
> summary(pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5
Standard deviation 4662.8657 3570.7164 2717.8351 1419.3137 819.15844
Proportion of Variance 0.4879 0.2861 0.1658 0.0452 0.01506
Cumulative Proportion 0.4879 0.7740 0.9397 0.9849 1.00000
Secondly, plotting the PCA. Even just using the basic code, I get an error and an empty plot image:
> ggbiplot(pca)
Error: invalid 'rot' value
What does this mean, and how can I fix it? Does it have something to do with the (non)scale in making the PCA, or is it something different? It must be something with my data, I think, since if I use a standard example code (below) I get a really nice PCA plot.
> data(wine)
> wine.pca=prcomp(wine,scale.=T)
> print(ggbiplot(wine.pca, obs.scale = 1, var.scale = 1, groups = wine.class,
ellipse = TRUE, circle = TRUE))
[EDIT 1] I have tried subsetting my data in two ways: 1) remove all columns were all rows are 0, and 2) remove all columns were any rows are 0. The first subsetting still gives me the scale
error, but not the ones that have removed columns with any 0's. Why is this? How does this affect my PCA?
Also, I tried doing using the normal biplot
command for both the original data (non-scaled) and the subsetted data above, and it works in both cases. So it's something to do with with ggbiplot
?
[EDIT 2] I have uploaded a subset of my data that gives me the error when I don't remove all the zeroes and works when I do. I haven't used gist before, but I think this is it. Or this...
After transposing your data, I was able to replicate your error. The first error is the primary problem. PCA seeks to maximize the variance of each component so it is important that it doesn't focus on just one variable that may have very high variance. The first error:
Error in prcomp.default(tdf, center = T, scale. = T) :
cannot rescale a constant/zero column to unit variance
This is telling you that some of your variables have zero variance (i.e. no variability). Seeing how PCA is trying to group things by maximizing variance there is no point in retaining these variables. They can easily be removed with the following call:
df_f <- data[,apply(data, 2, var, na.rm=TRUE) != 0]
Once you do this filter, the remaining calls work appropriately
pca=prcomp(df_f,center=T,scale.=T)
ggbiplot(pca)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With