Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bad predictions from ranger compared to randomForest

I'm trying out the ranger R package to speed up doing a lot of randomForest calculations. I was examining the predictions I get back from it, and noticed something funny, in that predictions made are completely off.

Below is a reproducible example comparing randomForest and ranger.

data(iris)
library(randomForest)


iris_spec <- as.factor(iris$Species)
iris_dat <- as.matrix(iris[, !(names(iris) %in% "Species")])

set.seed(1234)

test_index <- sample(nrow(iris), 10)
train_index <- seq(1, nrow(iris))[-test_index]


iris_train <- randomForest(x = iris_dat[train_index, ], y = iris_spec[train_index], keep.forest = TRUE)
iris_pred <- predict(iris_train, iris_dat[test_index, ])

iris_train$confusion


##            setosa versicolor virginica class.error
## setosa         47          0         0  0.00000000
## versicolor      0         42         3  0.06666667
## virginica       0          4        44  0.08333333


cbind(as.character(iris_pred), as.character(iris_spec[test_index]))
##       [,1]         [,2]        
##  [1,] "setosa"     "setosa"    
##  [2,] "versicolor" "versicolor"
##  [3,] "versicolor" "versicolor"
##  [4,] "versicolor" "versicolor"
##  [5,] "virginica"  "virginica" 
##  [6,] "virginica"  "virginica" 
##  [7,] "setosa"     "setosa"    
##  [8,] "setosa"     "setosa"    
##  [9,] "versicolor" "versicolor"
## [10,] "versicolor" "versicolor"


library(ranger)


iris_train2 <- ranger(data = iris[train_index, ], dependent.variable.name = "Species", write.forest = TRUE)
iris_pred2 <- predict(iris_train2, iris[test_index, ])

iris_train2$classification.table


##             true
## predicted    setosa versicolor virginica
##   setosa         47          0         0
##   versicolor      0         41         3
##   virginica       0          4        45


cbind(as.character(iris_pred2$predictions), as.character(iris_spec[test_index]))

##       [,1]         [,2]        
##  [1,] "versicolor" "setosa"    
##  [2,] "virginica"  "versicolor"
##  [3,] "virginica"  "versicolor"
##  [4,] "virginica"  "versicolor"
##  [5,] "virginica"  "virginica" 
##  [6,] "virginica"  "virginica" 
##  [7,] "versicolor" "setosa"    
##  [8,] "versicolor" "setosa"    
##  [9,] "virginica"  "versicolor"
## [10,] "virginica"  "versicolor"


sessionInfo()

## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Fedora 22 (Twenty Two)
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ranger_0.2.7        randomForest_4.6-12
## 
## loaded via a namespace (and not attached):
## [1] magrittr_1.5  formatR_1.2.1 tools_3.2.2   Rcpp_0.12.1   stringi_0.5-5
## [6] knitr_1.11    stringr_1.0.0 evaluate_0.8

As you can see, the overall confusion tables look comparable, but the predictions are completely off for ranger. Has anyone else encountered this before?

like image 284
rmflight Avatar asked Oct 26 '15 15:10

rmflight


1 Answers

This was a bug. It is fixed in the GitHub version (see https://github.com/mnwright/ranger/issues/6) but the changes are not on CRAN yet. I will submit a new version to CRAN soon. In the meantime, please install the GitHub version:

devtools::install_github("mnwright/ranger/ranger-r-package/ranger")

Update: Fix is on CRAN since Nov. 10.

like image 194
mnwright Avatar answered Nov 12 '22 19:11

mnwright