When using random forests in <code>R</code> I came across the following situation: <pre class="prettyprint"><code>library(randomForest) set.seed(42) data(iris) rf_noImportance <- randomForest(Species~.,data=iris,ntree=100,importance=F) print(table(predict(rf_noImportance),iris$Species)) </code></pre> Output: <blockquote> <pre class="prettyprint"><code> setosa versicolor virginica setosa 50 0 0 versicolor 0 47 3 virginica 0 3 47 </code></pre> </blockquote> and <pre class="prettyprint"><code>library(randomForest) set.seed(42) data(iris) rf_importance <- randomForest(Species~.,data=iris,ntree=100,importance=T) print(table(predict(rf_importance),iris$Species)) </code></pre> Output: <blockquote> <pre class="prettyprint"><code> setosa versicolor virginica setosa 50 0 0 versicolor 0 47 4 virginica 0 3 46 </code></pre> </blockquote> In the first example I set <code>importance = FALSE</code> and in the second example <code>TRUE</code>. From my understanding this should not affect the resulting prediction. There's also no indication for that behavior in the documentation. According to the Cross Validated thread Do proximity or importance influence predictions by a random forest?, the importance flag should not influence the predictions, but it clearly does in the example above. So why is the importance parameter of the <code>randomForest</code> method influencing the performance of the model?

This is a nice example for demonstrating the limits of reproducibility; in short: <h3>Fixing the random seed does not make results comparable if the experiments invoke the random number generator in different ways.</h3> Let's see why this is so here... All reproducibility studies are based on a strong implicit assumption: all other being equal; if a change between two experiments invalidates this assumption, we cannot expect reproducibility in the deterministic sense you are seeking here (we may of course still expect reproducibility in the statistical sense, but this is not the issue here). The difference between the two cases you present here (calculating feature importance or not) is really subtle; to see why it actually violates the principle stated above, we have to dig a little, both in the documentation as well as in the source code. The documentation of the RF <code>importance</code> function already provides a strong hint (emphasis mine): <blockquote> Here are the definitions of the variable importance measures. The first measure is computed from permuting OOB data [...] Then the same is done after permuting each predictor variable. </blockquote> You may have already started becoming suspicious; such data permutations are normally performed in a random sense, hence potentially invoking the random number generator (RNG) for one extra process when we use <code>importance=TRUE</code>, which process is absent in the <code>importance=FALSE</code> case. In other words: if in the <code>importance=TRUE</code> case the RNG is involved in a way that is absent from the <code>importance=FALSE</code> case, then from the first time such a thing happens in the program the two cases stop being deterministically comparable, irrespectively of the common random seed. At this point, this may be a strong hint, but it is still only a speculation; after all, and in principle, permutations can be performed deterministically, i.e. without involving any random processes (and hence the RNG). Where is the smoking gun? Turns out that the smoking gun exists indeed, buried in the underlying C source code used by the <code>randomForest</code> R package; here is the relevant part of the C function <code>permuteOOB</code>: <pre class="prettyprint lang-c prettyprint-override"><code>/* Permute tp */ last = nOOB; for (i = 0; i < nOOB; ++i) { k = (int) last * unif_rand(); tmp = tp[last - 1]; tp[last - 1] = tp[k]; tp[k] = tmp; last--; } </code></pre> where we can clearly see the function <code>unif_rand()</code> (i.e. the RNG) being invoked at line #4 of the snippet (here in the source code), in a method called only when we ask for <code>importance=TRUE</code>, and not in the opposite case. Arguably, and given the fact that RF is an algorithm where randomness (and hence RNG use) enters from too many points, this should be enough evidence as to why the two cases you present are indeed not expected to be identical, since the different use of the RNG makes the actual results diverge. On the other hand, a difference of one single misclassification among 150 samples should provide enough reassurance that the two cases are still statistically similar. It should be also apparent that the subtle implementation issue here (i.e. the RNG involvement) does not violate the expectation that the two results should be equal in theory.

Why does the importance parameter influence performance of Random Forest in R?

Tags:

random

r

machine-learning

random-seed

random-forest

When using random forests in R I came across the following situation:

library(randomForest)
set.seed(42)
data(iris)
rf_noImportance <- randomForest(Species~.,data=iris,ntree=100,importance=F)
print(table(predict(rf_noImportance),iris$Species))

Output:

            setosa versicolor virginica
 setosa         50          0         0
 versicolor      0         47         3
 virginica       0          3        47

and

library(randomForest)
set.seed(42)
data(iris)  
rf_importance <- randomForest(Species~.,data=iris,ntree=100,importance=T)
print(table(predict(rf_importance),iris$Species))

Output:

            setosa versicolor virginica
 setosa         50          0         0
 versicolor      0         47         4
 virginica       0          3        46

In the first example I set importance = FALSE and in the second example TRUE. From my understanding this should not affect the resulting prediction. There's also no indication for that behavior in the documentation.

According to the Cross Validated thread Do proximity or importance influence predictions by a random forest?, the importance flag should not influence the predictions, but it clearly does in the example above.

So why is the importance parameter of the randomForest method influencing the performance of the model?

863

asked Aug 03 '20 06:08

Tinu

1 Answers

This is a nice example for demonstrating the limits of reproducibility; in short:

Fixing the random seed does not make results comparable if the experiments invoke the random number generator in different ways.

Let's see why this is so here...

All reproducibility studies are based on a strong implicit assumption: all other being equal; if a change between two experiments invalidates this assumption, we cannot expect reproducibility in the deterministic sense you are seeking here (we may of course still expect reproducibility in the statistical sense, but this is not the issue here).

The difference between the two cases you present here (calculating feature importance or not) is really subtle; to see why it actually violates the principle stated above, we have to dig a little, both in the documentation as well as in the source code.

The documentation of the RF importance function already provides a strong hint (emphasis mine):

Here are the definitions of the variable importance measures. The first measure is computed from permuting OOB data [...] Then the same is done after permuting each predictor variable.

You may have already started becoming suspicious; such data permutations are normally performed in a random sense, hence potentially invoking the random number generator (RNG) for one extra process when we use importance=TRUE, which process is absent in the importance=FALSE case.

In other words: if in the importance=TRUE case the RNG is involved in a way that is absent from the importance=FALSE case, then from the first time such a thing happens in the program the two cases stop being deterministically comparable, irrespectively of the common random seed.

At this point, this may be a strong hint, but it is still only a speculation; after all, and in principle, permutations can be performed deterministically, i.e. without involving any random processes (and hence the RNG). Where is the smoking gun?

Turns out that the smoking gun exists indeed, buried in the underlying C source code used by the randomForest R package; here is the relevant part of the C function permuteOOB:

/* Permute tp */
    last = nOOB;
    for (i = 0; i < nOOB; ++i) {
        k = (int) last * unif_rand();
        tmp = tp[last - 1];
        tp[last - 1] = tp[k];
        tp[k] = tmp;
        last--;
    }

where we can clearly see the function unif_rand() (i.e. the RNG) being invoked at line #4 of the snippet (here in the source code), in a method called only when we ask for importance=TRUE, and not in the opposite case.

Arguably, and given the fact that RF is an algorithm where randomness (and hence RNG use) enters from too many points, this should be enough evidence as to why the two cases you present are indeed not expected to be identical, since the different use of the RNG makes the actual results diverge. On the other hand, a difference of one single misclassification among 150 samples should provide enough reassurance that the two cases are still statistically similar. It should be also apparent that the subtle implementation issue here (i.e. the RNG involvement) does not violate the expectation that the two results should be equal in theory.

142

answered Nov 07 '22 10:11

desertnaut

Related questions
                            
                                Cross-references in RMarkdown for Word documents
                            
                                Using R-markdown knitr hooks to custom format tables in HTML reports
                            
                                Partial Row Labels Heatmap - R
                            
                                covr shows 0% of coverage while all tests with testthat pass
                            
                                Extract filters from R Shiny Datatable
                            
                                Convert a "loadings" object to a dataframe (R)
                            
                                R lubridate ymd_hms millisecond diff
                            
                                Split & extract part of string (between a "." and digit) in R
                            
                                Collapse error message into a single block when the error message is modified to print in red
                            
                                R: fast (conditional) subsetting where feasible
                            
                                How to optimize intersect of rows and columns in a matrix?
                            
                                How to change font size for all text in a ggplot object relative to current value?
                            
                                data shuffling by sample() decreases RMSE to lower value in testingset than trainingset
                            
                                R: Find object by name in deeply nested list
                            
                                How To Edit Common Legend Title In ggarrange?
                            
                                Apply a function to each column with a condition in data.table [R]
                            
                                Rearranging columns with pandas: Is there an equivalent to dplyr's select(..., everything())?
                            
                                Using SuperLU sparse solver with RcppArmadillo
                            
                                Warning in install.packages: converting NULL pointer to R NULL
                            
                                R: SpatialPointsDataFrame code no longer working. Error in !res[[1]] : invalid argument type

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With