Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

extracting more than 20 variables by importance via varImp

Tags:

r

r-caret

I'm dealing with a large dataset that involves more than 100 features (which are all relevant because they have already been filtered; the original dataset had over 500 features). I created a random forest model via the train() function from the caret package and using the "ranger" method.

Here's the question: how does one extract all of the variables by importance, as opposed to only the top 20 most important variables? The varImp() function yields only the top 20 variables by default.

Here's some sample code (minus the training set, which is very large):

library(caret)
rforest_model <- train(target_variable ~ .,
                       data = train_data_set,
                       method = "ranger",
                       importance = "impurity)

And here's the code for extracting variable importance:

varImp(rforest_model)
like image 584
Flavio Abdenur Avatar asked Jan 02 '18 03:01

Flavio Abdenur


People also ask

What does varImp do in R?

The varImp function tracks the changes in model statistics, such as the GCV, for each predictor and accumulates the reduction in the statistic when each predictor's feature is added to the model. This total reduction is used as the variable importance measure.

How do you calculate variable importance in random forest?

The default method to compute variable importance is the mean decrease in impurity (or gini importance) mechanism: At each split in each tree, the improvement in the split-criterion is the importance measure attributed to the splitting variable, and is accumulated over all the trees in the forest separately for each ...

How variable importance is calculated?

Variable importance is determined by calculating the relative influence of each variable: whether that variable was selected to split on during the tree building process, and how much the squared error (over all trees) improved (decreased) as a result.

What is variable importance in machine learning?

(My) definition: Variable importance refers to how much a given model "uses" that variable to make accurate predictions. The more a model relies on a variable to make predictions, the more important it is for the model. It can apply to many different models, each using different metrics.


1 Answers

The varImp function extracts importance for all variables (even if they are not used by the model), it just prints out the top 20 variables. Consider this example:

library(mlbench) #for data set
library(caret)
library(tidyverse)

set.seed(998)
data(Ionosphere)

rforest_model <- train(y = Ionosphere$Class,
                       x = Ionosphere[,1:34],
                       method = "ranger",
                       importance = "impurity")

nrow(varImp(rforest_model)$importance) #34 variables extracted

lets check them:

varImp(rforest_model)$importance %>% 
  as.data.frame() %>%
  rownames_to_column() %>%
  arrange(Overall) %>%
  mutate(rowname = forcats::fct_inorder(rowname )) %>%
  ggplot()+
    geom_col(aes(x = rowname, y = Overall))+
    coord_flip()+
    theme_bw()

enter image description here

note that V2 is a zero variance feature in this data set hence it has 0 importance and is not used by the model at all.

like image 56
missuse Avatar answered Sep 19 '22 12:09

missuse