Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RandomForest in R linear regression tails mtry

I am using the randomForest package in R (R version 2.13.1, randomForest version 4.6-2) for regression and noticed a significant bias in my results: the prediction error is dependent on the value of the response variable. High values are under predicted and low values are over predicted. At first I suspected this was a consequence of my data but the following simple example shows that this is inherent to the random forest algorithm:

n = 50; 
x1 = seq(1,n) 
x2 = matrix(1, n, 1)
predictors = data.frame(x1=x1, x2=x2)
response = x2 + x1
rf = randomForest(x=predictors, y=response)
plot(x1, response)
lines(x1, predict(rf, predictors), col="red")

No doubt tree methods have their limitations when it comes to linearity but even the simplest regression tree, e.g. tree() in R, does not exhibit this bias. I can't imagine that the community would be unaware of this but haven't found any mention, how is it generally corrected for? Thanks for any comments

EDIT: The example for this question is flawed, please see "RandomForest for regression in R - response distribution dependent bias" at stack exchange for an improved treatment https://stats.stackexchange.com/questions/28732/randomforest-for-regression-in-r-response-distribution-dependent-bias

like image 918
rumbleB Avatar asked May 09 '12 00:05

rumbleB


1 Answers

What you've discovered isn't an inherent bias in random forests, but simply a failure to properly adjust the tuning parameters on the model.

Using your example data:

rf = randomForest(x=predictors, y=response,mtry = 2,nodesize = 1)
plot(x1, response)
lines(x1, predict(rf, predictors), col="red")

enter image description here

For your real data the improvement will be unlikely to be so stark, of course, and I'd bet you'll get more mileage out of nodesize than mtry (mtry did most of the work here).

The reason that regular trees didn't exhibit this "bias" is because they, by default, search over all variables for the best split.

like image 164
joran Avatar answered Oct 08 '22 17:10

joran