Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does the importance parameter influence performance of Random Forest in R?

When using random forests in R I came across the following situation:

library(randomForest)
set.seed(42)
data(iris)
rf_noImportance <- randomForest(Species~.,data=iris,ntree=100,importance=F)
print(table(predict(rf_noImportance),iris$Species))

Output:

            setosa versicolor virginica
 setosa         50          0         0
 versicolor      0         47         3
 virginica       0          3        47

and

library(randomForest)
set.seed(42)
data(iris)  
rf_importance <- randomForest(Species~.,data=iris,ntree=100,importance=T)
print(table(predict(rf_importance),iris$Species))

Output:

            setosa versicolor virginica
 setosa         50          0         0
 versicolor      0         47         4
 virginica       0          3        46

In the first example I set importance = FALSE and in the second example TRUE. From my understanding this should not affect the resulting prediction. There's also no indication for that behavior in the documentation.

According to the Cross Validated thread Do proximity or importance influence predictions by a random forest?, the importance flag should not influence the predictions, but it clearly does in the example above.

So why is the importance parameter of the randomForest method influencing the performance of the model?

like image 863
Tinu Avatar asked Aug 03 '20 06:08

Tinu


People also ask

What is importance in random forest in R?

Random forests can be used to rank the importance of variables in a regression or classification problem. Interpretation : MeanDecreaseAccuracy table represents how much removing each variable reduces the accuracy of the model.

What does variable importance in random forest mean?

June 29, 2020 by Piotr Płoński Random forest. The feature importance (variable importance) describes which features are relevant. It can help with better understanding of the solved problem and sometimes lead to model improvements by employing the feature selection.

How would you improve the performance of random forest?

If you wish to speed up your random forest, lower the number of estimators. If you want to increase the accuracy of your model, increase the number of trees. Specify the maximum number of features to be included at each node split. This depends very heavily on your dataset.

How do you choose best parameters for random forest?

We can use the tuneRF () function for finding the optimal parameter: By default, the random Forest () function uses 500 trees and randomly selected predictors as potential candidates at each split. These parameters can be adjusted by using the tuneRF () function.


1 Answers

This is a nice example for demonstrating the limits of reproducibility; in short:

Fixing the random seed does not make results comparable if the experiments invoke the random number generator in different ways.

Let's see why this is so here...

All reproducibility studies are based on a strong implicit assumption: all other being equal; if a change between two experiments invalidates this assumption, we cannot expect reproducibility in the deterministic sense you are seeking here (we may of course still expect reproducibility in the statistical sense, but this is not the issue here).

The difference between the two cases you present here (calculating feature importance or not) is really subtle; to see why it actually violates the principle stated above, we have to dig a little, both in the documentation as well as in the source code.

The documentation of the RF importance function already provides a strong hint (emphasis mine):

Here are the definitions of the variable importance measures. The first measure is computed from permuting OOB data [...] Then the same is done after permuting each predictor variable.

You may have already started becoming suspicious; such data permutations are normally performed in a random sense, hence potentially invoking the random number generator (RNG) for one extra process when we use importance=TRUE, which process is absent in the importance=FALSE case.

In other words: if in the importance=TRUE case the RNG is involved in a way that is absent from the importance=FALSE case, then from the first time such a thing happens in the program the two cases stop being deterministically comparable, irrespectively of the common random seed.

At this point, this may be a strong hint, but it is still only a speculation; after all, and in principle, permutations can be performed deterministically, i.e. without involving any random processes (and hence the RNG). Where is the smoking gun?

Turns out that the smoking gun exists indeed, buried in the underlying C source code used by the randomForest R package; here is the relevant part of the C function permuteOOB:

/* Permute tp */
    last = nOOB;
    for (i = 0; i < nOOB; ++i) {
        k = (int) last * unif_rand();
        tmp = tp[last - 1];
        tp[last - 1] = tp[k];
        tp[k] = tmp;
        last--;
    }

where we can clearly see the function unif_rand() (i.e. the RNG) being invoked at line #4 of the snippet (here in the source code), in a method called only when we ask for importance=TRUE, and not in the opposite case.

Arguably, and given the fact that RF is an algorithm where randomness (and hence RNG use) enters from too many points, this should be enough evidence as to why the two cases you present are indeed not expected to be identical, since the different use of the RNG makes the actual results diverge. On the other hand, a difference of one single misclassification among 150 samples should provide enough reassurance that the two cases are still statistically similar. It should be also apparent that the subtle implementation issue here (i.e. the RNG involvement) does not violate the expectation that the two results should be equal in theory.

like image 142
desertnaut Avatar answered Nov 07 '22 10:11

desertnaut