PCA for dimensionality reduction before Random Forest

Tags:

I am working on binary class random forest with approximately 4500 variables. Many of these variables are highly correlated and some of them are just quantiles of an original variable. I am not quite sure if it would be wise to apply PCA for dimensionality reduction. Would this increase the model performance?

I would like to be able to know which variables are more significant to my model, but if I use PCA, I would be only able to tell what PCs are more important.

Many thanks in advance.

628

asked Aug 14 '15 16:08

Rita A. Singer

2 Answers

My experience is that PCA before RF is not an great advantage if any. Principal component regression(PCR) is e.g. when, PCA assists to regularize training features before OLS linear regression and that is very needed for sparse data-sets. As RF itself already performs a good/fair regularization without assuming linearity, it is not necessarily an advantage. That said, I found my self writing a PCA-RF wrapper for R two weeks ago. The code includes a simulated data set of a data set of 100 features comprising only 5 true linear components. Under such cercumstances it is infact a small advantage to pre-filter with PCA The code is a seamless implementation, such that every RF parameters are simply passed on to RF. Loading vector are saved in model_fit to use during prediction.

@I would like to be able to know which variables are more significant to my model, but if I use PCA, I would be only able to tell what PCs are more important.

The easy way is to run without PCA and obtain variable importances and expect to find something similar for PCA-RF.

The tedious way, wrap the PCA-RF in a new bagging scheme with your own variable importance code. Could be done in 50-100 lines or so.

The souce-code suggestion for PCA-RF:

#wrap PCA around randomForest, forward any other arguments to randomForest
#define as new S3 model class
train_PCA_RF = function(x,y,ncomp=5,...) {
  f.args=as.list(match.call()[-1])
  pca_obj = princomp(x)
  rf_obj = do.call(randomForest,c(alist(x=pca_obj$scores[,1:ncomp]),f.args[-1]))
  out=mget(ls())
  class(out) = "PCA_RF"
  return(out)    
}

#print method
print.PCA_RF = function(object) print(object$rf_obj)

#predict method
predict.PCA_RF = function(object,Xtest=NULL,...) {
  print("predicting PCA_RF")
  f.args=as.list(match.call()[-1])
  if(is.null(f.args$Xtest)) stop("cannot predict without newdata parameter")
  sXtest = predict(object$pca_obj,Xtest) #scale Xtest as Xtrain was scaled before
  return(do.call(predict,c(alist(object = object$rf_obj, #class(x)="randomForest" invokes method predict.randomForest
                                 newdata = sXtest),      #newdata input, see help(predict.randomForest)
                                 f.args[-1:-2])))  #any other parameters are passed to predict.randomForest

}

#testTrain predict #
make.component.data = function(
  inter.component.variance = .9,
  n.real.components = 5,
  nVar.per.component = 20,
  nObs=600,
  noise.factor=.2,
  hidden.function = function(x) apply(x,1,mean),
  plot_PCA =T
){
  Sigma=matrix(inter.component.variance,
               ncol=nVar.per.component,
               nrow=nVar.per.component)
  diag(Sigma)  = 1
  x = do.call(cbind,replicate(n = n.real.components,
                              expr = {mvrnorm(n=nObs,
                                              mu=rep(0,nVar.per.component),
                                              Sigma=Sigma)},
                              simplify = FALSE)
            )
  if(plot_PCA) plot(prcomp(x,center=T,.scale=T))
  y = hidden.function(x)
  ynoised = y + rnorm(nObs,sd=sd(y)) * noise.factor
  out = list(x=x,y=ynoised)
  pars = ls()[!ls() %in% c("x","y","Sigma")]
  attr(out,"pars") = mget(pars) #attach all pars as attributes
  return(out)
}

A run code example:

#start script------------------------------
#source above from separate script
#test
library(MASS)
library(randomForest)

Data = make.component.data(nObs=600)#plots PC variance
train = list(x=Data$x[  1:300,],y=Data$y[1:300])
test = list(x=Data$x[301:600,],y=Data$y[301:600])

rf = randomForest (train$x, train$y,ntree =50) #regular RF
rf2 = train_PCA_RF(train$x, train$y,ntree= 50,ncomp=12)

rf
rf2


pred_rf = predict(rf  ,test$x)
pred_rf2 = predict(rf2,test$x)

cat("rf, R^2:",cor(test$y,pred_rf  )^2,"PCA_RF, R^2", cor(test$y,pred_rf2)^2)

cor(test$y,predict(rf ,test$x))^2
cor(test$y,predict(rf2,test$x))^2

pairs(list(trueY = test$y,
           native_rf = pred_rf,
           PCA_RF = pred_rf2)
)

135

answered Sep 28 '22 18:09

Soren Havelund Welling

You can have a look here to get a better idea. The link says use PCA for smaller datasets!! Some of my colleagues have used Random Forests for the same purpose when working with Genomes. They had ~30000 variables and large amount of RAM.

Another thing I found is that Random Forests use up a lot of Memory and you have 4500 variables. So, may be you could apply PCA to the individual Trees.

answered Sep 28 '22 17:09

Animesh Pandey

Related questions
                            
                                Python: Why are eigenvectors not the same as first PCA weights?
                            
                                Convert a "loadings" object to a dataframe (R)
                            
                                Measure of Feature Importance in PCA
                            
                                How to drop a perpendicular line from each point in a scatterplot to an (Eigen)vector?
                            
                                Principal Component Analysis with Eigen Library
                            
                                Most important original feature(s) of Principal Component Analysis
                            
                                Scikit-learn principal component analysis (PCA) for dimension reduction
                            
                                doing PCA on very large data set in R
                            
                                scikit-learn PCA: matrix transformation produces PC estimates with flipped signs
                            
                                Labeling points in a biplot
                            
                                prcomp and ggbiplot: invalid 'rot' value
                            
                                classification: PCA and logistic regression using sklearn
                            
                                Change point colors and color of frame/ellipse around points
                            
                                Extracting PCA axes for further analysis
                            
                                PCA in matlab selecting top n components
                            
                                clusplot - showing variables
                            
                                How to export an interactive rgl 3D Plot to share or publish?
                            
                                Customization of pointshape within function fviz_pca from FactoExtra package
                            
                                SVM Visualization in MATLAB

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PCA for dimensionality reduction before Random Forest

Tags:

random-forest

pca

dimensionality-reduction

Rita A. Singer

People also ask

2 Answers

Soren Havelund Welling

Animesh Pandey

Recent Activity

Donate For Us