I'm experimenting with R and the randomForest Package, I have some experience with SVM and Neural Nets. My first test is to try and regress: sin(x)+gaussian noise. With Neural Nets and svm I obtain a "relatively" nice approximation of sin(x) so the noise is filtered out and the learning algorithm doesn't overfit. (for decent parameters) When doing the same on randomForest I have a completely overfitted solution. I simply use (R 2.14.0, tried on 2.14.1 too, just in case):
library("randomForest")
x<-seq(-3.14,3.14,by=0.00628)
noise<-rnorm(1001)
y<-sin(x)+noise/4
mat<-matrix(c(x,y),ncol=2,dimnames=list(NULL,c("X","Y")))
plot(x,predict(randomForest(Y~.,data=mat),mat),col="green")
points(x,y)
I guess there is a magic option in randomForest to make it work correctly, I tried a few but I did not find the right lever to pull...
In addition to classification, Random Forests can also be used for regression tasks. A Random Forest's nonlinear nature can give it a leg up over linear algorithms, making it a great option. However, it is important to know your data and keep in mind that a Random Forest can't extrapolate.
Advertisements. In the random forest approach, a large number of decision trees are created. Every observation is fed into every decision tree. The most common outcome for each observation is used as the final output.
Random forest is a type of supervised learning algorithm that uses ensemble methods (bagging) to solve both regression and classification problems. The algorithm operates by constructing a multitude of decision trees at training time and outputting the mean/mode of prediction of the individual trees.
You can use maxnodes
to limit the size of the trees,
as in the examples in the manual.
r <- randomForest(Y~.,data=mat, maxnodes=10)
plot(x,predict(r,mat),col="green")
points(x,y)
You can do a lot better (rmse ~ 0.04, $R^2$ > 0.99) by training individual trees on small samples or bites as Breiman called them
Since there is a significant amount of noise in the training data, this problem is really about smoothing rather than generalization. In general machine learning terms this requires increasing regularization. For ensemble learner this means trading strength for diversity.
Diversity of randomForests can be increasing by reducing the number of candidate feature per split (mtry
in R) or the training set of each tree (sampsize
in R). Since there is only 1 input dimesions, mtry
does not help, leaving sampsize
. This leads to a 3.5x improvement in RMSE over the default settings and >6x improvement over the noisy training data itself. Since increased divresity means increased variance in the prediction of the individual learners, we also need to increase the number of trees to stabilize the ensemble prediction.
small bags, more trees :: rmse = 0.04:
>sd(predict(randomForest(Y~.,data=mat, sampsize=60, nodesize=2,
replace=FALSE, ntree=5000),
mat)
- sin(x))
[1] 0.03912643
default settings :: rmse=0.14:
> sd(predict(randomForest(Y~.,data=mat),mat) - sin(x))
[1] 0.1413018
error due to noise in training set :: rmse = 0.25
> sd(y - sin(x))
[1] 0.2548882
The error due to noise is of course evident from
noise<-rnorm(1001)
y<-sin(x)+noise/4
In the above the evaluation is being done against the training set, as it is in the original question. Since the issue is smoothing rather than generalization, this is not as egregious as it may seem, but it is reassuring to see that out of bag evaluation shows similar accuracy:
> sd(predict(randomForest(Y~.,data=mat, sampsize=60, nodesize=2,
replace=FALSE, ntree=5000))
- sin(x))
[1] 0.04059679
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With