Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RandomForest for Regression in R

I'm experimenting with R and the randomForest Package, I have some experience with SVM and Neural Nets. My first test is to try and regress: sin(x)+gaussian noise. With Neural Nets and svm I obtain a "relatively" nice approximation of sin(x) so the noise is filtered out and the learning algorithm doesn't overfit. (for decent parameters) When doing the same on randomForest I have a completely overfitted solution. I simply use (R 2.14.0, tried on 2.14.1 too, just in case):

library("randomForest")
x<-seq(-3.14,3.14,by=0.00628)
noise<-rnorm(1001)
y<-sin(x)+noise/4
mat<-matrix(c(x,y),ncol=2,dimnames=list(NULL,c("X","Y")))
plot(x,predict(randomForest(Y~.,data=mat),mat),col="green")
points(x,y)

I guess there is a magic option in randomForest to make it work correctly, I tried a few but I did not find the right lever to pull...

like image 398
user1206729 Avatar asked Feb 13 '12 12:02

user1206729


People also ask

Can random forest be used for regression?

In addition to classification, Random Forests can also be used for regression tasks. A Random Forest's nonlinear nature can give it a leg up over linear algorithms, making it a great option. However, it is important to know your data and keep in mind that a Random Forest can't extrapolate.

What is Randomforest package in R?

Advertisements. In the random forest approach, a large number of decision trees are created. Every observation is fed into every decision tree. The most common outcome for each observation is used as the final output.

How does Randomforest regression work?

Random forest is a type of supervised learning algorithm that uses ensemble methods (bagging) to solve both regression and classification problems. The algorithm operates by constructing a multitude of decision trees at training time and outputting the mean/mode of prediction of the individual trees.


2 Answers

You can use maxnodes to limit the size of the trees, as in the examples in the manual.

r <- randomForest(Y~.,data=mat, maxnodes=10)
plot(x,predict(r,mat),col="green")
points(x,y)
like image 113
Vincent Zoonekynd Avatar answered Oct 20 '22 10:10

Vincent Zoonekynd


You can do a lot better (rmse ~ 0.04, $R^2$ > 0.99) by training individual trees on small samples or bites as Breiman called them

Since there is a significant amount of noise in the training data, this problem is really about smoothing rather than generalization. In general machine learning terms this requires increasing regularization. For ensemble learner this means trading strength for diversity.

Diversity of randomForests can be increasing by reducing the number of candidate feature per split (mtry in R) or the training set of each tree (sampsize in R). Since there is only 1 input dimesions, mtry does not help, leaving sampsize. This leads to a 3.5x improvement in RMSE over the default settings and >6x improvement over the noisy training data itself. Since increased divresity means increased variance in the prediction of the individual learners, we also need to increase the number of trees to stabilize the ensemble prediction.

small bags, more trees :: rmse = 0.04:

>sd(predict(randomForest(Y~.,data=mat, sampsize=60, nodesize=2,
                         replace=FALSE, ntree=5000),
            mat)
    - sin(x))
[1] 0.03912643

default settings :: rmse=0.14:

> sd(predict(randomForest(Y~.,data=mat),mat) - sin(x))
[1] 0.1413018

error due to noise in training set :: rmse = 0.25

> sd(y - sin(x))
[1] 0.2548882

The error due to noise is of course evident from

noise<-rnorm(1001)
y<-sin(x)+noise/4

In the above the evaluation is being done against the training set, as it is in the original question. Since the issue is smoothing rather than generalization, this is not as egregious as it may seem, but it is reassuring to see that out of bag evaluation shows similar accuracy:

> sd(predict(randomForest(Y~.,data=mat, sampsize=60, nodesize=2,
                          replace=FALSE, ntree=5000))
     - sin(x))
[1] 0.04059679
like image 27
Daniel Mahler Avatar answered Oct 20 '22 11:10

Daniel Mahler