RandomForest for Regression in R

Tags:

I'm experimenting with R and the randomForest Package, I have some experience with SVM and Neural Nets. My first test is to try and regress: sin(x)+gaussian noise. With Neural Nets and svm I obtain a "relatively" nice approximation of sin(x) so the noise is filtered out and the learning algorithm doesn't overfit. (for decent parameters) When doing the same on randomForest I have a completely overfitted solution. I simply use (R 2.14.0, tried on 2.14.1 too, just in case):

library("randomForest")
x<-seq(-3.14,3.14,by=0.00628)
noise<-rnorm(1001)
y<-sin(x)+noise/4
mat<-matrix(c(x,y),ncol=2,dimnames=list(NULL,c("X","Y")))
plot(x,predict(randomForest(Y~.,data=mat),mat),col="green")
points(x,y)

I guess there is a magic option in randomForest to make it work correctly, I tried a few but I did not find the right lever to pull...

398

asked Feb 13 '12 12:02

user1206729

2 Answers

You can use maxnodes to limit the size of the trees, as in the examples in the manual.

r <- randomForest(Y~.,data=mat, maxnodes=10)
plot(x,predict(r,mat),col="green")
points(x,y)

113

answered Oct 20 '22 10:10

Vincent Zoonekynd

You can do a lot better (rmse ~ 0.04, $R^2$ > 0.99) by training individual trees on small samples or bites as Breiman called them

Since there is a significant amount of noise in the training data, this problem is really about smoothing rather than generalization. In general machine learning terms this requires increasing regularization. For ensemble learner this means trading strength for diversity.

Diversity of randomForests can be increasing by reducing the number of candidate feature per split (mtry in R) or the training set of each tree (sampsize in R). Since there is only 1 input dimesions, mtry does not help, leaving sampsize. This leads to a 3.5x improvement in RMSE over the default settings and >6x improvement over the noisy training data itself. Since increased divresity means increased variance in the prediction of the individual learners, we also need to increase the number of trees to stabilize the ensemble prediction.

small bags, more trees :: rmse = 0.04:

>sd(predict(randomForest(Y~.,data=mat, sampsize=60, nodesize=2,
                         replace=FALSE, ntree=5000),
            mat)
    - sin(x))
[1] 0.03912643

default settings :: rmse=0.14:

> sd(predict(randomForest(Y~.,data=mat),mat) - sin(x))
[1] 0.1413018

error due to noise in training set :: rmse = 0.25

> sd(y - sin(x))
[1] 0.2548882

The error due to noise is of course evident from

noise<-rnorm(1001)
y<-sin(x)+noise/4

In the above the evaluation is being done against the training set, as it is in the original question. Since the issue is smoothing rather than generalization, this is not as egregious as it may seem, but it is reassuring to see that out of bag evaluation shows similar accuracy:

> sd(predict(randomForest(Y~.,data=mat, sampsize=60, nodesize=2,
                          replace=FALSE, ntree=5000))
     - sin(x))
[1] 0.04059679

answered Oct 20 '22 11:10

Daniel Mahler

Related questions
                            
                                R package: writing internal data, but not all at once
                            
                                How to configure the curl package in R with default web proxy settings?
                            
                                Compiled R code is actually slower than pure R with JIT enabled
                            
                                How to compute the Topological Overlap Measure [TOM] for a weighted adjacency matrix in Python?
                            
                                floating TOC for prettydoc in Rmarkdown ask for theme
                            
                                Create a questionnaire with R Shiny
                            
                                How to profile the loading of an R package
                            
                                sf: How to get back to MULTIPOLYGON from GEOMETRYCOLLECTION?
                            
                                How to merge two lists based on object indices - keeping attributes?
                            
                                How to run for loop in debug mode within RStudio?
                            
                                How to avoid the connection lines in geom_line or geom_path when there is no data?
                            
                                How can I add a logo to a ggplot visualisation?
                            
                                Do we talk about reference type and primitive type in R?
                            
                                What can a data frame do that a tibble cannot?
                            
                                What does read_csv() use random numbers for?
                            
                                Breaking the tapply junkie habit
                            
                                Contributing R test scripts
                            
                                Reorder legend without changing order of points on plot
                            
                                zoo object aggregation
                            
                                Apply a function to each row in a data.frame and append the result to the data.frame in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

RandomForest for Regression in R

Tags:

r

regression

random-forest

user1206729

People also ask

2 Answers

Vincent Zoonekynd

Daniel Mahler

Recent Activity

Donate For Us