tidymodels: ranger with cross validation

Tags:

tidymodels

The dataset can be found here: https://www.kaggle.com/mlg-ulb/creditcardfraud

I am trying to use tidymodels to run ranger with 5 fold cross validation on this dataset.

I have have 2 code blocks. The first code block is the original code with the full data. The second code block is almost identical to the first code block, except I have subset a portion of the data so the code runs faster. The second block of code is just to make sure my code works before I run it on the original dataset.

Here is the first code block with the full data:

#load packages
library(tidyverse)
library(tidymodels)
library(tune)
library(workflows)

#load data
df <- read.csv("~creditcard.csv")

#check for NAs and convert Class to factor
anyNA(df)
df$Class <- as.factor(df$Class)

#set seed and split data into training and testing
set.seed(123)
df_split <- initial_split(df)
df_train <- training(df_split)
df_test <- testing(df_split)

#in the training and testing datasets, how many are fraudulent transactions?
df_train %>% count(Class)
df_test %>% count(Class)

#ranger model with 5-fold cross validation
rf_spec <- 
  rand_forest() %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

all_wf <- 
  workflow() %>% 
  add_formula(Class ~ .) %>% 
  add_model(rf_spec)

cv_folds <- vfold_cv(df_train, v = 5)
cv_folds

rf_results <-
  all_wf %>% 
  fit_resamples(resamples = cv_folds)

rf_results %>% 
  collect_metrics()

Here is the second code block with 1,000 rows:

#load packages
library(tidyverse)
library(tidymodels)
library(tune)
library(workflows)

#load data
df <- read.csv("~creditcard.csv")

###################################################################################
#Testing area#
df <- df %>% arrange(-Class) %>% head(1000)

###################################################################################

#check for NAs and convert Class to factor
anyNA(df)
df$Class <- as.factor(df$Class)

#set seed and split data into training and testing
set.seed(123)
df_split <- initial_split(df)
df_train <- training(df_split)
df_test <- testing(df_split)

#in the training and testing datasets, how many are fraudulent transactions?
df_train %>% count(Class)
df_test %>% count(Class)

#ranger model with 5-fold cross validation
rf_spec <- 
  rand_forest() %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

all_wf <- 
  workflow() %>% 
  add_formula(Class ~ .) %>% 
  add_model(rf_spec)

cv_folds <- vfold_cv(df_train, v = 5)
cv_folds

rf_results <-
  all_wf %>% 
  fit_resamples(resamples = cv_folds)

rf_results %>% 
  collect_metrics()

1.) With the the first code block, I can assign and print cv folds in the console. The Global Enviornment data says cv_folds has 5 obs. of 2 variables. When I View(cv_folds), I have columns labeled splits and id, but there are no rows and no data. When I use str(cv_folds), I get the blank loading line that R is "thinking", but there is not a red STOP icon I can push. The only thing I can do is force quit RStudio. Maybe I just need to wait longer? I am not sure. When I do the same thing with the smaller second code block, str() works fine.

2) My overall goal for this project is to split the dataset into training and testing sets. Then partition the training data with 5 fold cross validation and train a ranger model on it. Next, I want to examine the metrics of my model on the training data. Then I want to test my model on the testing set and view the metrics. Eventually, I want to swap out ranger for something like xgboost. Please give me advice on what parts of my code I can add/modify to improve. I am still missing the portion of testing my model on the testing set.

I think the Predictions portion of this article might be what I'm aiming for.
https://rviews.rstudio.com/2019/06/19/a-gentle-intro-to-tidymodels/

3) When I use rf_results %>% collect_metrics(), it only shows accuracy and roc_auc. How do I get sensitivity, specificity, precision, and recall?

4) How do I plot importance? Would I use something like this?

rf_fit <- get_tree_fit(all_wf)
vip::vip(rf_fit, geom = "point")

5) How can I drastically reduce the amount of time for the model to train? Last time I ran ranger with 5 fold cross validation using caret on this dataset, it took 8+ hours (6 core, 4.0 ghz, 16gb RAM, SSD, gtx 1060). I am open to anything (ie. restructure code, AWS computing, parallelization, etc.)

Edit: This is another way I have tried to set this up

#ranger model with 5-fold cross validation
rf_recipe <- recipe(Class ~ ., data = df_train)

rf_engine <- 
  rand_forest(mtry = tune(), trees = tune(), min_n = tune()) %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

rf_grid <- grid_random(
  mtry() %>% range_set(c(1, 20)),
  trees() %>% range_set(c(500, 1000)), 
  min_n() %>% range_set(c(2, 10)),
  size = 30)

all_wf <- 
  workflow() %>% 
  add_recipe(rf_recipe) %>% 
  add_model(rf_engine)

cv_folds <- vfold_cv(df_train, v = 5)
cv_folds

#####
rf_fit <- tune_grid(
  all_wf,
  resamples = cv_folds,
  grid = rf_grid,
  metrics = metric_set(roc_auc),
  control = control_grid(save_pred = TRUE)
)

collect_metrics(rf_fit)

rf_fit_best <- select_best(rf_fit)
(wf_rf_best <- finalize_workflow(all_wf, rf_fit_best))

372

asked Feb 23 '20 23:02

OTA

1 Answers

I started with your last block of code and made some edits to have a functional workflow. I answered to your questions along the code. I have taken the liberty to give you some advice and reformat your code.

## Packages, seed and data
library(tidyverse)
library(tidymodels)

set.seed(123)

df <- read_csv("creditcard.csv")

df <- 
  df %>% 
  arrange(-Class) %>% 
  head(1000) %>% 
  mutate(Class = as_factor(Class))


## Modelisation

# Initial split
df_split <- initial_split(df)
df_train <- training(df_split)
df_test <- testing(df_split)

You can see that df_split returns <750/250/1000> (see below).

2) To tune the xgboost model, you have very little things to change.

# Models

model_rf <- 
  rand_forest(mtry = tune(), trees = tune(), min_n = tune()) %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

model_xgboost <- 
  boost_tree(mtry = tune(), trees = tune(), min_n = tune()) %>% 
  set_engine("xgboost", importance = "impurity") %>% 
  set_mode("classification")

Here you choose your hyperparameter grid. I advise you to use a non random grid to visit the space of hypermarameters in an optimal way.

# Grid of hyperparameters

grid_rf <- 
  grid_max_entropy(        
    mtry(range = c(1, 20)), 
    trees(range = c(500, 1000)),
    min_n(range = c(2, 10)),
    size = 30)

These are your workflows, as you can see, virtually nothing to change.

# Workflow

wkfl_rf <- 
  workflow() %>% 
  add_formula(Class ~ .) %>% 
  add_model(model_rf)

wkfl_wgboost <- 
  workflow() %>% 
  add_formula(Class ~ .) %>% 
  add_model(model_xgboost)

1) <600/150/750> means that you have 600 observations in your training set, 150 in your validation set and a total of 750 observation in the original dataset. Plese note that, here, 600 + 150 = 750 but this is not always the case (e.g. with boostrap methods with resampling).

# Cross validation method

cv_folds <- vfold_cv(df_train, v = 5)
cv_folds

3) Here you choose which metrics you want to collect during your tuning, with the yardstik package.

# Choose metrics

my_metrics <- metric_set(roc_auc, accuracy, sens, spec, precision, recall)

Then you can compute different models according to the grid. For the control parameters, don't save prediction and print progress (imho).

# Tuning

rf_fit <- tune_grid(
  wkfl_rf,
  resamples = cv_folds,
  grid = grid_rf,
  metrics = my_metrics,
  control = control_grid(verbose = TRUE) # don't save prediction (imho)
)

These are some useful function to deals with the rf_fit object.

# Inspect tuning 

rf_fit
collect_metrics(rf_fit)
autoplot(rf_fit, metric = "accuracy")
show_best(rf_fit, metric = "accuracy", maximize = TRUE)
select_best(rf_fit, metric = "accuracy", maximize = TRUE)

Finally, you can fit your model according to best parameters.

# Fit best model 

tuned_model <-
  wkfl_rf %>% 
  finalize_workflow(select_best(rf_fit, metric = "accuracy", maximize = TRUE)) %>% 
  fit(data = df_train)

predict(tuned_model, df_train)
predict(tuned_model, df_test)

4) unfortunately, methods to deals with randomForest objects are usually not availables with parnsnip outputs

5) You can have a look at the vignette about parallelization.

181

answered Sep 30 '22 18:09

abichat

Related questions
                            
                                NoSuchMethodError when using Scala in R with rscala
                            
                                How to combine similar strings showing most common characters
                            
                                How to install R 3.4.4 in alpine
                            
                                Is there a way around casting large integers as string when querying data from BigQuery through R?
                            
                                Fast sorted sample with replacement
                            
                                Not enough space Error appears when running for loop for 13K pdf documents
                            
                                Shiny DT: Freeze rownames while sorting?
                            
                                Rmarkdown - Use table name as variable in dynamic sql chunk?
                            
                                Shiny selectInput depending on validated reactive does not pass on validation error
                            
                                Replacing missing values
                            
                                brain-teaser R problem - programming problem solving in R
                            
                                ggplot lines from point to origin and cosine scores
                            
                                header on first page and others
                            
                                `geom_histogram` and `stat_bin()` don't align
                            
                                Checking if there exists a value in vector of dates that lies within a given range
                            
                                Reshape data in R with fixed effect information within column
                            
                                How to clean up the function closure (environment) when returning and saving it?
                            
                                match.call() returns a function or a symbol, but symbols can't be used by do.call()
                            
                                Could not find function "CreateSinglerObject"
                            
                                Compacting Shared Libraries in R package

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With