Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

New data points for existing PCA model

I have followed this tutorial to create and visualise a PCA. The part Im particularly interested in is adding new data points to the existing model.

As the tutorial suggests, one would use predict (ir.pca, newdata=tail(log.ir, 2)) to predict new PCs. But how do I add these new observations to the existing plot ? It doesnt look like predict function returns the same object as the ir.pca used in ggplot function.

I have found similar questions here and here but these are calculating new PCA scores and adding them to the variance plot (if I understood it correctly).

Ultimately what Im after is to see whether new points fall in within the confidence ellipse defined/derived using the initial dataset.

The code I'm using from the tutorial:

 # log transform 
    log.ir <- log(iris[, 1:4])
    ir.species <- iris[, 5]

 
# apply PCA - scale. = TRUE is highly 
# advisable, but default is FALSE. 
ir.pca <- prcomp(log.ir,
                 center = TRUE,
                 scale. = TRUE) 

library(devtools)
install_github("ggbiplot", "vqv")
 
library(ggbiplot)
g <- ggbiplot(ir.pca, obs.scale = 1, var.scale = 1, 
              groups = ir.species, ellipse = TRUE, 
              circle = TRUE)
g <- g + scale_color_discrete(name = '')
g <- g + theme(legend.direction = 'horizontal', 
               legend.position = 'top')
print(g)

And as the tutorial suggests I'd like to add new data which came in to the existing plot visualised with ggplot

Thanks

like image 983
nlv Avatar asked Mar 11 '26 20:03

nlv


1 Answers

When we inspect the ggplot object, we see it has an element named data:

str(g)
# List of 9
#  $ data       :'data.frame':  150 obs. of  3 variables:
#   ..$ xvar  : num [1:150] -2.41 -2.22 -2.58 -2.45 -2.54 ...
#   ..$ yvar  : num [1:150] -0.397 0.69 0.428 0.686 -0.508 ...
#   ..$ groups: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#  $ layers     :List of 5
#  <snip>

Hence we can just add the new data points to the data dataframe. Suppose these 10 observations from iris are our "new" observations, and we predict their PC values:

set.seed(123)
x <- sample(seq_len(nrow(iris)), 10)
predicted <- predict(ir.pca, newdata = log.ir[x, ])

We can add these predicted values to the data dataframe

g$data <- rbind(g$data, 
  data.frame(
    xvar = predicted[, "PC1"],
    yvar = predicted[, "PC2"],
    groups = "new"
  )
)

so that print(g) yields enter image description here

like image 194
Weihuang Wong Avatar answered Mar 13 '26 10:03

Weihuang Wong