Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does ranger predict give different numbers when re-applied to training data?

I am very new to machine learning. I am trying to explore fitting random forests with the ranger library in R. My dependent variable is continuous - so it would be a regression tree (and not just classification). Upon trying out the functions, I have noticed that there seems to be a discrepancy between ranger and predict ranger. The following lines result in different predictions in results and results_alternative:

rf_reg <- ranger(formula = y ~ ., data = training_df)

results <- rf_reg$predictions
results_alterantive <- predict(rf_reg, data = training_df)$predictions

Could anybody please explain why there is a discrepancy and what is causing it? Which one is correct? I have tried it with classification on iris data and that seemed to give the same results. Many thanks!

like image 951
Jhonny Avatar asked Oct 20 '25 23:10

Jhonny


1 Answers

The predictions using rf_reg$predictions are based only on the out-of-bag samples (as stated in the "value" section of ?ranger). On the other hand, predict.ranger is based on all samples.

To demonstrate this, first lets train a RF model (using mtcars as some example data). We use the keep.inbag = TRUE argument, so that we will know which samples were in-bag versus out-of-bag for each tree.

rf = ranger(formula = mpg ~ ., data = mtcars, keep.inbag = TRUE)

Now we generate the predictions using three methods. The first two are the same as in the question. We also add a third predict method where we specify predict.all = TRUE, which will give us separate predictions for all trees. That will allow us to take averages of the individual tree predictions according to whether the observation was in or out of bag.

results.rf   = rf$predictions   # based on out-of-bag samples
results.pred = predict(rf, data = mtcars)$predictions #  based on all samples
results.all  = predict(rf, data = mtcars, predict.all = TRUE)$predictions # has all trees and all samples

Now we can check

  1. whether results.pred is identical to the average across all trees from results.all; and

  2. whether results.rf is identical to the average across all trees from only the out-of-bag samples.

I demonstrate here for the first sample (i.e. the first row in mtcars).

inbag.counts = sapply(rf$inbag.counts, \(x) x[1])
oob = (inbag.counts == 0) # logical vector of which trees the sample is out-of-bag. 

all.equal(mean(results.all[1,    ]),  results.pred[1])
# [1] TRUE
all.equal(mean(results.all[1, oob]),  results.rf  [1])
# [1] TRUE

We can do the same check for all rows too:

for (i in 1:nrow(mtcars)) {
  inbag.counts = sapply(rf$inbag.counts, \(x) x[i])
  oob = (inbag.counts == 0) 
  print(all.equal(mean(results.all[i,    ]),  results.pred[i]))
  print(all.equal(mean(results.all[i, oob]),  results.rf  [i]))
}
like image 106
dww Avatar answered Oct 22 '25 12:10

dww