I'm trying out the ranger
R package to speed up doing a lot of randomForest
calculations. I was examining the predictions I get back from it, and noticed something funny, in that predictions made are completely off.
Below is a reproducible example comparing randomForest
and ranger
.
data(iris)
library(randomForest)
iris_spec <- as.factor(iris$Species)
iris_dat <- as.matrix(iris[, !(names(iris) %in% "Species")])
set.seed(1234)
test_index <- sample(nrow(iris), 10)
train_index <- seq(1, nrow(iris))[-test_index]
iris_train <- randomForest(x = iris_dat[train_index, ], y = iris_spec[train_index], keep.forest = TRUE)
iris_pred <- predict(iris_train, iris_dat[test_index, ])
iris_train$confusion
## setosa versicolor virginica class.error
## setosa 47 0 0 0.00000000
## versicolor 0 42 3 0.06666667
## virginica 0 4 44 0.08333333
cbind(as.character(iris_pred), as.character(iris_spec[test_index]))
## [,1] [,2]
## [1,] "setosa" "setosa"
## [2,] "versicolor" "versicolor"
## [3,] "versicolor" "versicolor"
## [4,] "versicolor" "versicolor"
## [5,] "virginica" "virginica"
## [6,] "virginica" "virginica"
## [7,] "setosa" "setosa"
## [8,] "setosa" "setosa"
## [9,] "versicolor" "versicolor"
## [10,] "versicolor" "versicolor"
library(ranger)
iris_train2 <- ranger(data = iris[train_index, ], dependent.variable.name = "Species", write.forest = TRUE)
iris_pred2 <- predict(iris_train2, iris[test_index, ])
iris_train2$classification.table
## true
## predicted setosa versicolor virginica
## setosa 47 0 0
## versicolor 0 41 3
## virginica 0 4 45
cbind(as.character(iris_pred2$predictions), as.character(iris_spec[test_index]))
## [,1] [,2]
## [1,] "versicolor" "setosa"
## [2,] "virginica" "versicolor"
## [3,] "virginica" "versicolor"
## [4,] "virginica" "versicolor"
## [5,] "virginica" "virginica"
## [6,] "virginica" "virginica"
## [7,] "versicolor" "setosa"
## [8,] "versicolor" "setosa"
## [9,] "virginica" "versicolor"
## [10,] "virginica" "versicolor"
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Fedora 22 (Twenty Two)
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ranger_0.2.7 randomForest_4.6-12
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5 formatR_1.2.1 tools_3.2.2 Rcpp_0.12.1 stringi_0.5-5
## [6] knitr_1.11 stringr_1.0.0 evaluate_0.8
As you can see, the overall confusion tables look comparable, but the predictions are completely off for ranger
. Has anyone else encountered this before?
This was a bug. It is fixed in the GitHub version (see https://github.com/mnwright/ranger/issues/6) but the changes are not on CRAN yet. I will submit a new version to CRAN soon. In the meantime, please install the GitHub version:
devtools::install_github("mnwright/ranger/ranger-r-package/ranger")
Update: Fix is on CRAN since Nov. 10.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With