I am trying to use the random forests package for classification in R. The Variable Importance Measures listed are: <ul> <li>mean raw importance score of variable x for class 0</li> <li>mean raw importance score of variable x for class 1</li> <li><code>MeanDecreaseAccuracy</code></li> <li><code>MeanDecreaseGini</code></li> </ul> Now I know what these "mean" as in I know their definitions. What I want to know is how to use them. What I really want to know is what these values mean in only the context of how accurate they are, what is a good value, what is a bad value, what are the maximums and minimums, etc. If a variable has a high <code>MeanDecreaseAccuracy</code> or <code>MeanDecreaseGini</code> does that mean it is important or unimportant? Also any information on raw scores could be useful too. I want to know everything there is to know about these numbers that is relevant to the application of them. An explanation that uses the words 'error', 'summation', or 'permutated' would be less helpful then a simpler explanation that didn't involve any discussion of how random forests works. Like if I wanted someone to explain to me how to use a radio, I wouldn't expect the explanation to involve how a radio converts radio waves into sound.

For your immediate concern: higher values mean the variables are more important. This should be true for all the measures you mention. Random forests give you pretty complex models, so it can be tricky to interpret the importance measures. If you want to easily understand what your variables are doing, don't use RFs. Use linear models or a (non-ensemble) decision tree instead. You said: <blockquote> An explanation that uses the words 'error', 'summation', or 'permutated' would be less helpful then a simpler explanation that didn't involve any discussion of how random forests works. </blockquote> It's going to be awfully tough to explain much more than the above unless you dig in and learn what about random forests. I assume you're complaining about either the manual, or the section from Breiman's manual: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#varimp To figure out how important a variable is, they fill it with random junk ("permute" it), then see how much predictive accuracy decreases. MeanDecreaseAccuracy and MeanDecreaseGini work this way. I'm not sure what the raw importance scores are.

R Random Forests Variable Importance

Tags:

r

statistics

data-mining

random-forest

I am trying to use the random forests package for classification in R.

The Variable Importance Measures listed are:

mean raw importance score of variable x for class 0
mean raw importance score of variable x for class 1
MeanDecreaseAccuracy
MeanDecreaseGini

Now I know what these "mean" as in I know their definitions. What I want to know is how to use them.

What I really want to know is what these values mean in only the context of how accurate they are, what is a good value, what is a bad value, what are the maximums and minimums, etc.

If a variable has a high MeanDecreaseAccuracy or MeanDecreaseGini does that mean it is important or unimportant? Also any information on raw scores could be useful too. I want to know everything there is to know about these numbers that is relevant to the application of them.

An explanation that uses the words 'error', 'summation', or 'permutated' would be less helpful then a simpler explanation that didn't involve any discussion of how random forests works.

Like if I wanted someone to explain to me how to use a radio, I wouldn't expect the explanation to involve how a radio converts radio waves into sound.

748

asked Apr 10 '09 02:04

thirsty93

2 Answers

An explanation that uses the words 'error', 'summation', or 'permutated' would be less helpful then a simpler explanation that didn't involve any discussion of how random forests works.

Like if I wanted someone to explain to me how to use a radio, I wouldn't expect the explanation to involve how a radio converts radio waves into sound.

How would you explain what the numbers in WKRP 100.5 FM "mean" without going into the pesky technical details of wave frequencies? Frankly parameters and related performance issues with Random Forests are difficult to get your head around even if you understand some technical terms.

Here's my shot at some answers:

-mean raw importance score of variable x for class 0

-mean raw importance score of variable x for class 1

Simplifying from the Random Forest web page, raw importance score measures how much more helpful than random a particular predictor variable is in successfully classifying data.

-MeanDecreaseAccuracy

I think this is only in the R module, and I believe it measures how much inclusion of this predictor in the model reduces classification error.

-MeanDecreaseGini

Gini is defined as "inequity" when used in describing a society's distribution of income, or a measure of "node impurity" in tree-based classification. A low Gini (i.e. higher descrease in Gini) means that a particular predictor variable plays a greater role in partitioning the data into the defined classes. It's a hard one to describe without talking about the fact that data in classification trees are split at individual nodes based on values of predictors. I'm not so clear on how this translates into better performance.

151

answered Sep 19 '22 15:09

bubaker

For your immediate concern: higher values mean the variables are more important. This should be true for all the measures you mention.

Random forests give you pretty complex models, so it can be tricky to interpret the importance measures. If you want to easily understand what your variables are doing, don't use RFs. Use linear models or a (non-ensemble) decision tree instead.

You said:

An explanation that uses the words 'error', 'summation', or 'permutated' would be less helpful then a simpler explanation that didn't involve any discussion of how random forests works.

It's going to be awfully tough to explain much more than the above unless you dig in and learn what about random forests. I assume you're complaining about either the manual, or the section from Breiman's manual:

http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#varimp

To figure out how important a variable is, they fill it with random junk ("permute" it), then see how much predictive accuracy decreases. MeanDecreaseAccuracy and MeanDecreaseGini work this way. I'm not sure what the raw importance scores are.

answered Sep 21 '22 15:09

Brendan OConnor

Related questions
                            
                                Remove 'search' option but leave 'search columns' option
                            
                                Create a sequence between two letters
                            
                                Change the color of action button in shiny
                            
                                Plotting multiple time series on the same plot using ggplot()
                            
                                Rounding selected columns of data.table in R
                            
                                How to count how many values per level in a given factor?
                            
                                how to calculate the Euclidean norm of a vector in R?
                            
                                converting multiple columns from character to numeric format in r
                            
                                How to get an rmarkdown vignette for R package to escape CRAN warnings on Solaris and OSX
                            
                                rMaps ichoropleth with custom map/shp
                            
                                what's preventing additions to the current set of R reserved words/symbols?
                            
                                Must R Packages Unload Dynamic Libraries When They Unload?
                            
                                How to specify columns in facet_grid OR how to change labels in facet_wrap
                            
                                Why does data.table update names(DT) by reference, even if I assign to another variable?
                            
                                How to pass extra argument to the function argument of do.call in R
                            
                                How to install R package from private repo using devtools install_github?
                            
                                Release memory in R
                            
                                Changing font in PDF produced by rmarkdown
                            
                                Set the size of ggsave exactly
                            
                                How to do printf in r?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With