I am very new to machine learning. I am trying to explore fitting random forests with the ranger library in R. My dependent variable is continuous - so it would be a regression tree (and not just classification). Upon trying out the functions, I have noticed that there seems to be a discrepancy between ranger and predict ranger. The following lines result in different predictions in results and results_alternative:
rf_reg <- ranger(formula = y ~ ., data = training_df)
results <- rf_reg$predictions
results_alterantive <- predict(rf_reg, data = training_df)$predictions
Could anybody please explain why there is a discrepancy and what is causing it? Which one is correct? I have tried it with classification on iris data and that seemed to give the same results. Many thanks!
The predictions using rf_reg$predictions are based only on the out-of-bag samples (as stated in the "value" section of ?ranger). On the other hand, predict.ranger is based on all samples.
To demonstrate this, first lets train a RF model (using mtcars as some example data). We use the keep.inbag = TRUE argument, so that we will know which samples were in-bag versus out-of-bag for each tree.
rf = ranger(formula = mpg ~ ., data = mtcars, keep.inbag = TRUE)
Now we generate the predictions using three methods. The first two are the same as in the question. We also add a third predict method where we specify predict.all = TRUE, which will give us separate predictions for all trees. That will allow us to take averages of the individual tree predictions according to whether the observation was in or out of bag.
results.rf = rf$predictions # based on out-of-bag samples
results.pred = predict(rf, data = mtcars)$predictions # based on all samples
results.all = predict(rf, data = mtcars, predict.all = TRUE)$predictions # has all trees and all samples
Now we can check
whether results.pred is identical to the average across all trees from results.all; and
whether results.rf is identical to the average across all trees from only the out-of-bag samples.
I demonstrate here for the first sample (i.e. the first row in mtcars).
inbag.counts = sapply(rf$inbag.counts, \(x) x[1])
oob = (inbag.counts == 0) # logical vector of which trees the sample is out-of-bag.
all.equal(mean(results.all[1, ]), results.pred[1])
# [1] TRUE
all.equal(mean(results.all[1, oob]), results.rf [1])
# [1] TRUE
We can do the same check for all rows too:
for (i in 1:nrow(mtcars)) {
inbag.counts = sapply(rf$inbag.counts, \(x) x[i])
oob = (inbag.counts == 0)
print(all.equal(mean(results.all[i, ]), results.pred[i]))
print(all.equal(mean(results.all[i, oob]), results.rf [i]))
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With