Using randomForest package in R, how to get probabilities from classification model?

Tags:

TL;DR :

Is there something I can flag in the original randomForest call to avoid having to re-run the predict function to get predicted categorical probabilities, instead of just the likely category?

Details:

I am using the randomForest package.

I have a model something like:

model <- randomForest(x=out.data[train.rows, feature.cols],                       y=out.data[train.rows, response.col],                       xtest=out.data[test.rows, feature.cols],                       ytest=out.data[test.rows, response.col],                       importance= TRUE)

where out.data is a data frame, with feature.cols a mixture of numeric and categorical features, while response.col is a TRUE / FALSE binary variable, that I forced into factor so that randomForest model will properly treat it as categorical.

All runs well, and the variable model is returned to me properly. However, I cannot seem to find a flag or parameter to pass to the randomForest function so that model is returned to me with the probabilities of TRUE or FALSE. Instead, I get simply predicted values. That is, if I look at model$predicted, I'll see something like:

FALSE FALSE TRUE TRUE FALSE . . .

Instead, I want to see something like:

   FALSE  TRUE 1  0.84   0.16 2  0.66   0.34 3  0.11   0.89 4  0.17   0.83 5  0.92   0.08 .   .      . .   .      . .   .      .

I can get the above, but in order to do so, I need to do something like:

tmp <- predict(model, out.data[test.rows, feature.cols], "prob")

[test.rows captures the row numbers for those that were used during the model testing. The details are not shown here, but are simple since the test row IDs are output into model.]

Then everything works fine. The problem is that the model is big and takes a very long time to run, and even the prediction itself takes a while. Since the prediction should be entirely unnecessary (I am simply looking to calculate the ROC curve on the test data set, the data set that should have already been calculated), I was hoping to skip this step. Is there something I can flag in the original randomForest call to avoid having to re-run the predict function?

940

asked Sep 07 '14 22:09

Mike Williamson

1 Answers

model$predicted is NOT the same thing returned by predict(). If you want the probability of the TRUE or FALSE class then you must run predict(), or pass x,y,xtest,ytest like

randomForest(x,y,xtest=x,ytest=y),

where x=out.data[, feature.cols], y=out.data[, response.col].

model$predicted returns the class based on which class had the larger value in model$votes for each record. votes, as @joran pointed out is the proportion of OOB(out of bag) ‘votes’ from the random forest, a vote only counting when the record was selected in an OOB sample. On the other hand predict() returns the true probability for each class based on votes by all the trees.

Using randomForest(x,y,xtest=x,ytest=y) functions a little differently than when passing a formula or simply randomForest(x,y), as in the example given above. randomForest(x,y,xtest=x,ytest=y) WILL return the probability for each class, this may sound a little weird, but it is found under model$test$votes, and the predicted class under model$test$predicted, which simply selects the class based on which class had the larger value in model$test$votes. Also, when using randomForest(x,y,xtest=x,ytest=y), model$predicted and model$votes have the same definition as above.

Finally, just to note, if randomForest(x,y,xtest=x,ytest=y) is used, then, in order to use predict() function the keep.forest flag should be set to TRUE.

model=randomForest(x,y,xtest=x,ytest=y,keep.forest=TRUE).  prob=predict(model,x,type="prob")

prob WILL be equivalent to model$test$votes since the test data input are both x.

186

answered Oct 22 '22 01:10

Oscar

Related questions
                            
                                In Linux, is there a way to find out which PCI card is plugged into which PCI slot?
                            
                                How to disable Spring autowiring in unit tests for @Configuration/@Bean usage
                            
                                how to work around Travis CIs 4MB output limit?
                            
                                Is there a Javascript function similar to the Python Counter function?
                            
                                init(coder:) has not been implemented in swift
                            
                                Apache virtual host without domain name
                            
                                Linux Bash XMLLINT with XPATH
                            
                                Integer Vs Long Confusion
                            
                                How can I check what objects will be cascade deleted in Django?
                            
                                RxJava and Cached Data
                            
                                Why do I need to define a precision value in webgl shaders?
                            
                                Pcap functions have "undefined reference"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With