RandomForest in R linear regression tails mtry

Question

I am using the randomForest package in R (R version 2.13.1, randomForest version 4.6-2) for regression and noticed a significant bias in my results: the prediction error is dependent on the value of the response variable. High values are under predicted and low values are over predicted. At first I suspected this was a consequence of my data but the following simple example shows that this is inherent to the random forest algorithm:

n = 50; 
x1 = seq(1,n) 
x2 = matrix(1, n, 1)
predictors = data.frame(x1=x1, x2=x2)
response = x2 + x1
rf = randomForest(x=predictors, y=response)
plot(x1, response)
lines(x1, predict(rf, predictors), col="red")

No doubt tree methods have their limitations when it comes to linearity but even the simplest regression tree, e.g. tree() in R, does not exhibit this bias. I can't imagine that the community would be unaware of this but haven't found any mention, how is it generally corrected for? Thanks for any comments

EDIT: The example for this question is flawed, please see "RandomForest for regression in R - response distribution dependent bias" at stack exchange for an improved treatment https://stats.stackexchange.com/questions/28732/randomforest-for-regression-in-r-response-distribution-dependent-bias

joran · Accepted Answer

What you've discovered isn't an inherent bias in random forests, but simply a failure to properly adjust the tuning parameters on the model.

Using your example data:

rf = randomForest(x=predictors, y=response,mtry = 2,nodesize = 1)
plot(x1, response)
lines(x1, predict(rf, predictors), col="red")

enter image description here

For your real data the improvement will be unlikely to be so stark, of course, and I'd bet you'll get more mileage out of nodesize than mtry (mtry did most of the work here).

The reason that regular trees didn't exhibit this "bias" is because they, by default, search over all variables for the best split.

RandomForest in R linear regression tails mtry

Tags:

r

statistics

regression

random-forest

rumbleB

1 Answers

joran

Recent Activity

Donate For Us

RandomForest in R linear regression tails mtry

Tags:

r

statistics

regression

random-forest

rumbleB

1 Answers

joran

Related questions

Recent Activity

Donate For Us