Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PCA for dimensionality reduction before Random Forest

I am working on binary class random forest with approximately 4500 variables. Many of these variables are highly correlated and some of them are just quantiles of an original variable. I am not quite sure if it would be wise to apply PCA for dimensionality reduction. Would this increase the model performance?

I would like to be able to know which variables are more significant to my model, but if I use PCA, I would be only able to tell what PCs are more important.

Many thanks in advance.

like image 628
Rita A. Singer Avatar asked Aug 14 '15 16:08

Rita A. Singer


People also ask

When would you use PCA for dimensionality reduction?

Perhaps the most popular technique for dimensionality reduction in machine learning is Principal Component Analysis, or PCA for short. This is a technique that comes from the field of linear algebra and can be used as a data preparation technique to create a projection of a dataset prior to fitting a model.

Is dimensionality reduction needed for Random Forest?

Random forest is useful for dimensionality reduction when you have a well-defined supervised learning problem.

Is it preferable to do PCA before cart?

In theory, you could get the same performance out of a model whether you used PCA beforehand or not. In practice, having a better structured data can make or break a model. In any case, give PCA a shot. It might be worth your while.

When should you use PCA?

PCA should be used mainly for variables which are strongly correlated. If the relationship is weak between variables, PCA does not work well to reduce data. Refer to the correlation matrix to determine. In general, if most of the correlation coefficients are smaller than 0.3, PCA will not help.


2 Answers

My experience is that PCA before RF is not an great advantage if any. Principal component regression(PCR) is e.g. when, PCA assists to regularize training features before OLS linear regression and that is very needed for sparse data-sets. As RF itself already performs a good/fair regularization without assuming linearity, it is not necessarily an advantage. That said, I found my self writing a PCA-RF wrapper for R two weeks ago. The code includes a simulated data set of a data set of 100 features comprising only 5 true linear components. Under such cercumstances it is infact a small advantage to pre-filter with PCA The code is a seamless implementation, such that every RF parameters are simply passed on to RF. Loading vector are saved in model_fit to use during prediction.

@I would like to be able to know which variables are more significant to my model, but if I use PCA, I would be only able to tell what PCs are more important.

The easy way is to run without PCA and obtain variable importances and expect to find something similar for PCA-RF.

The tedious way, wrap the PCA-RF in a new bagging scheme with your own variable importance code. Could be done in 50-100 lines or so.

The souce-code suggestion for PCA-RF:

#wrap PCA around randomForest, forward any other arguments to randomForest
#define as new S3 model class
train_PCA_RF = function(x,y,ncomp=5,...) {
  f.args=as.list(match.call()[-1])
  pca_obj = princomp(x)
  rf_obj = do.call(randomForest,c(alist(x=pca_obj$scores[,1:ncomp]),f.args[-1]))
  out=mget(ls())
  class(out) = "PCA_RF"
  return(out)    
}

#print method
print.PCA_RF = function(object) print(object$rf_obj)

#predict method
predict.PCA_RF = function(object,Xtest=NULL,...) {
  print("predicting PCA_RF")
  f.args=as.list(match.call()[-1])
  if(is.null(f.args$Xtest)) stop("cannot predict without newdata parameter")
  sXtest = predict(object$pca_obj,Xtest) #scale Xtest as Xtrain was scaled before
  return(do.call(predict,c(alist(object = object$rf_obj, #class(x)="randomForest" invokes method predict.randomForest
                                 newdata = sXtest),      #newdata input, see help(predict.randomForest)
                                 f.args[-1:-2])))  #any other parameters are passed to predict.randomForest

}

#testTrain predict #
make.component.data = function(
  inter.component.variance = .9,
  n.real.components = 5,
  nVar.per.component = 20,
  nObs=600,
  noise.factor=.2,
  hidden.function = function(x) apply(x,1,mean),
  plot_PCA =T
){
  Sigma=matrix(inter.component.variance,
               ncol=nVar.per.component,
               nrow=nVar.per.component)
  diag(Sigma)  = 1
  x = do.call(cbind,replicate(n = n.real.components,
                              expr = {mvrnorm(n=nObs,
                                              mu=rep(0,nVar.per.component),
                                              Sigma=Sigma)},
                              simplify = FALSE)
            )
  if(plot_PCA) plot(prcomp(x,center=T,.scale=T))
  y = hidden.function(x)
  ynoised = y + rnorm(nObs,sd=sd(y)) * noise.factor
  out = list(x=x,y=ynoised)
  pars = ls()[!ls() %in% c("x","y","Sigma")]
  attr(out,"pars") = mget(pars) #attach all pars as attributes
  return(out)
}

A run code example:

#start script------------------------------
#source above from separate script
#test
library(MASS)
library(randomForest)

Data = make.component.data(nObs=600)#plots PC variance
train = list(x=Data$x[  1:300,],y=Data$y[1:300])
test = list(x=Data$x[301:600,],y=Data$y[301:600])

rf = randomForest (train$x, train$y,ntree =50) #regular RF
rf2 = train_PCA_RF(train$x, train$y,ntree= 50,ncomp=12)

rf
rf2


pred_rf = predict(rf  ,test$x)
pred_rf2 = predict(rf2,test$x)

cat("rf, R^2:",cor(test$y,pred_rf  )^2,"PCA_RF, R^2", cor(test$y,pred_rf2)^2)

cor(test$y,predict(rf ,test$x))^2
cor(test$y,predict(rf2,test$x))^2

pairs(list(trueY = test$y,
           native_rf = pred_rf,
           PCA_RF = pred_rf2)
)
like image 135
Soren Havelund Welling Avatar answered Sep 28 '22 18:09

Soren Havelund Welling


You can have a look here to get a better idea. The link says use PCA for smaller datasets!! Some of my colleagues have used Random Forests for the same purpose when working with Genomes. They had ~30000 variables and large amount of RAM.

Another thing I found is that Random Forests use up a lot of Memory and you have 4500 variables. So, may be you could apply PCA to the individual Trees.

like image 39
Animesh Pandey Avatar answered Sep 28 '22 17:09

Animesh Pandey