When using random forests in R
I came across the following situation:
library(randomForest)
set.seed(42)
data(iris)
rf_noImportance <- randomForest(Species~.,data=iris,ntree=100,importance=F)
print(table(predict(rf_noImportance),iris$Species))
Output:
setosa versicolor virginica setosa 50 0 0 versicolor 0 47 3 virginica 0 3 47
and
library(randomForest)
set.seed(42)
data(iris)
rf_importance <- randomForest(Species~.,data=iris,ntree=100,importance=T)
print(table(predict(rf_importance),iris$Species))
Output:
setosa versicolor virginica setosa 50 0 0 versicolor 0 47 4 virginica 0 3 46
In the first example I set importance = FALSE
and in the second example TRUE
. From my understanding this should not affect the resulting prediction. There's also no indication for that behavior in the documentation.
According to the Cross Validated thread Do proximity or importance influence predictions by a random forest?, the importance flag should not influence the predictions, but it clearly does in the example above.
So why is the importance parameter of the randomForest
method influencing the performance of the model?
Random forests can be used to rank the importance of variables in a regression or classification problem. Interpretation : MeanDecreaseAccuracy table represents how much removing each variable reduces the accuracy of the model.
June 29, 2020 by Piotr Płoński Random forest. The feature importance (variable importance) describes which features are relevant. It can help with better understanding of the solved problem and sometimes lead to model improvements by employing the feature selection.
If you wish to speed up your random forest, lower the number of estimators. If you want to increase the accuracy of your model, increase the number of trees. Specify the maximum number of features to be included at each node split. This depends very heavily on your dataset.
We can use the tuneRF () function for finding the optimal parameter: By default, the random Forest () function uses 500 trees and randomly selected predictors as potential candidates at each split. These parameters can be adjusted by using the tuneRF () function.
This is a nice example for demonstrating the limits of reproducibility; in short:
Let's see why this is so here...
All reproducibility studies are based on a strong implicit assumption: all other being equal; if a change between two experiments invalidates this assumption, we cannot expect reproducibility in the deterministic sense you are seeking here (we may of course still expect reproducibility in the statistical sense, but this is not the issue here).
The difference between the two cases you present here (calculating feature importance or not) is really subtle; to see why it actually violates the principle stated above, we have to dig a little, both in the documentation as well as in the source code.
The documentation of the RF importance
function already provides a strong hint (emphasis mine):
Here are the definitions of the variable importance measures. The first measure is computed from permuting OOB data [...] Then the same is done after permuting each predictor variable.
You may have already started becoming suspicious; such data permutations are normally performed in a random sense, hence potentially invoking the random number generator (RNG) for one extra process when we use importance=TRUE
, which process is absent in the importance=FALSE
case.
In other words: if in the importance=TRUE
case the RNG is involved in a way that is absent from the importance=FALSE
case, then from the first time such a thing happens in the program the two cases stop being deterministically comparable, irrespectively of the common random seed.
At this point, this may be a strong hint, but it is still only a speculation; after all, and in principle, permutations can be performed deterministically, i.e. without involving any random processes (and hence the RNG). Where is the smoking gun?
Turns out that the smoking gun exists indeed, buried in the underlying C source code used by the randomForest
R package; here is the relevant part of the C function permuteOOB
:
/* Permute tp */
last = nOOB;
for (i = 0; i < nOOB; ++i) {
k = (int) last * unif_rand();
tmp = tp[last - 1];
tp[last - 1] = tp[k];
tp[k] = tmp;
last--;
}
where we can clearly see the function unif_rand()
(i.e. the RNG) being invoked at line #4 of the snippet (here in the source code), in a method called only when we ask for importance=TRUE
, and not in the opposite case.
Arguably, and given the fact that RF is an algorithm where randomness (and hence RNG use) enters from too many points, this should be enough evidence as to why the two cases you present are indeed not expected to be identical, since the different use of the RNG makes the actual results diverge. On the other hand, a difference of one single misclassification among 150 samples should provide enough reassurance that the two cases are still statistically similar. It should be also apparent that the subtle implementation issue here (i.e. the RNG involvement) does not violate the expectation that the two results should be equal in theory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With