Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does R randomForest's rfcv method actually say which features it selected, or not?

I would like to use rfcv to cull the unimportant variables from a data set before creating a final random forest with more trees (please correct and inform me if that's not the way to use this function). For example,

>     data(fgl, package="MASS")
>     tst <- rfcv(trainx = fgl[,-10], trainy = fgl[,10], scale = "log", step=0.7)
>     tst$error.cv
        9         6         4         3         2         1 
0.2289720 0.2149533 0.2523364 0.2570093 0.3411215 0.5093458

In this case, if I understand the result correctly, it seems that we can remove three variables without negative side effects. However,

>     attributes(tst)
$names
[1] "n.var"     "error.cv"  "predicted"

None of these slots tells me what those first three variables that can be harmlessly removed from the dataset actually were.

like image 940
tresbot Avatar asked Aug 10 '12 20:08

tresbot


People also ask

Does random forest do feature selection?

Random Forest is a very powerful model both for regression and classification. It can give its own interpretation of feature importance as well, which can be plotted and used for selecting the most informative set of features according, for example, to a Recursive Feature Elimination procedure.

Why is MTRY important?

A large mtry ensures that there is (with high probability) at least one strong variable in the set of mtry candidate variables. performs better than p/3 regarding the mean squared error.

How many variables does a random forest have?

There are other options in random forests that we illustrate using the dna data set. There are 60 variables, all four-valued categorical, three classes, 2000 cases in the training set and 1186 in the test set.


1 Answers

I think the purpose of rfcv is to establish how your accuracy is related to the number of variables you use. This might not seem useful when you have 10 variables, but when you have thousands of variables it is quite handy to understand how much those variables "add" to the predictive power.

As you probably found out, this code

rf<-randomForest(type ~ .,data=fgl)
importance(rf)

gives you the relative importance of each of the variables.

like image 155
nograpes Avatar answered Oct 30 '22 21:10

nograpes