Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Cost Sensitive C50 in caret

Tags:

r

r-caret

I am using train in caret package to train some c50 models. I manage to do fine with the method C5.0 but when I want to use the cost sensitive C50 method I struggle understanding how to tune the cost parameter. What I am trying to do is to introduce a cost when predicting wrong one of my classes. I've try searching in the caret package website (http://topepo.github.io/caret/index.html) and reading several manuals/tutorials found here and there. I didn't find any information about how to handle the cost parameter. So this is what I tried on my own:

  1. Run the train with the default settings to see what I get. In the output, the train function tried with cost from 0 to 2 and gave the best model for cost=2.

  2. Try to add in the expand.grid function the cost as a matrix, the same way you'd do using the package C5.0. The code is below (trials is pushed to 1 cause I just want one tree/set of rules in my output)

    c50Grid <- expand.grid(.trials=1, .model=c("tree", "rules"), .winnow=c("TRUE", "FALSE"), .cost=matrix(c(0,1,2,0), ncol=2))

However when I execute the train function, although I don't get any errors (but I get 50 warnings), the train tried again cost from 0 to 2. What am I doing wrong? Which format has the cost parameter? What's the meaning here? How would I interpret the results? Which class is the one getting the cost as "Predicting class 0 wrong cost double than class 1"? Also, what I tried was using one matrix, but although it didn't work with this format, how would I add the different costs that I want to test?

Thanks! Any help would be really welcome!


Edit:

So, trying to find an answer on my own about the meaning of the cost parameter for the C5.0Cost, I went to the C5.0Cost.R (https://r-forge.r-project.org/scm/viewvc.php/models/files/C5.0Cost.R?view=markup&root=caret&pathrev=761) and looked up the code. This line:

cmat <-matrix(c(0, param$cost, 1, 0), ncol = 2)

I guess, it's passing the cost parameter to the cost matrix. So, I think now I can understand how it works. If I have class = {0,1} and my positive class is 0, this matrix says that "Predicting class 0 wrong costs double than class 1", right? My question now is, how could I do the opposite? How could I set that "Predicting class 1 wrong costs double than class 0", which would be:

cmat <- matrix(c(0, 1, param$cost, 0), ncol=2)

Could I just set the cost to 0.5? And if want to train with different values, just use values less than 1 { 0.5, 0.6, 0.7, etc}. Note: the way my data is, when I used C50 or other trees before, it takes as "Positive class = 0", so I had to invert the cost matrix when I used C50 so if I use caret method C5.0Cost, I'd need to do the same or find another way to do it...

I'd really appreciate any help here. Thanks!

like image 631
Fabiola Fernández Avatar asked Oct 01 '14 11:10

Fabiola Fernández


2 Answers

There is a cost-senstivite model code for train and C5.0 (use method = "C5.0Cost"). For example:

library(caret)

set.seed(1)
dat1 <- twoClassSim(1000, intercept = -12)
dat2 <- twoClassSim(1000, intercept = -12)

stats <- function (data, lev = NULL, model = NULL)  {
  c(postResample(data[, "pred"], data[, "obs"]),
    Sens = sensitivity(data[, "pred"], data[, "obs"]),
    Spec = specificity(data[, "pred"], data[, "obs"]))
}

ctrl <- trainControl(method = "repeatedcv", repeats = 5,
                     summaryFunction = stats)

set.seed(2)
mod1 <- train(Class ~ ., data = dat1, 
              method = "C5.0",
              tuneGrid = expand.grid(model = "tree", winnow = FALSE,
                                     trials = c(1:10, (1:5)*10)),
              trControl = ctrl)

xyplot(Sens + Spec ~ trials, data = mod1$results, 
       type = "l",
       auto.key = list(columns = 2, 
                       lines = TRUE, 
                       points = FALSE))

set.seed(2)
mod2 <- train(Class ~ ., data = dat1, 
              method = "C5.0Cost",
              tuneGrid = expand.grid(model = "tree", winnow = FALSE,
                                     trials = c(1:10, (1:5)*10),
                                     cost = 1:10),
              trControl = ctrl)

xyplot(Sens + Spec ~ trials|format(cost), data = mod2$results, 
       type = "l",
       auto.key = list(columns = 2, 
                       lines = TRUE, 
                       points = FALSE))

Max

like image 106
topepo Avatar answered Nov 10 '22 07:11

topepo


If I have class = {0,1} and my positive class is 0, this matrix says that "Predicting class 0 wrong costs double than class 1", right? My question now is, how could I do the opposite? How could I set that "Predicting class 1 wrong costs double than class 0" [...]?

Unfortunately, you can't change the costs for the false positives in caret at the moment. This appears to be a bug! See this post for further information about this issue.

like image 1
JimBoy Avatar answered Nov 10 '22 08:11

JimBoy