Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

cost function in cv.glm of boot library in R

I am trying to use the crossvalidation cv.glm function from the boot library in R to determine the number of misclassifications when a glm logistic regression is applied.

The function has the following signature:

cv.glm(data, glmfit, cost, K)

with the first two denoting the data and model and K specifies the k-fold. My problem is the cost parameter which is defined as:

cost: A function of two vector arguments specifying the cost function for the crossvalidation. The first argument to cost should correspond to the observed responses and the second argument should correspond to the predicted or fitted responses from the generalized linear model. cost must return a non-negative scalar value. The default is the average squared error function.

I guess for classification it would make sense to have a function which returns the rate of misclassification something like:

nrow(subset(data, (predict >= 0.5  & data$response == "no") | 
                  (predict <  0.5  & data$response == "yes")))

which is of course not even syntactically correct.

Unfortunately, my limited R knowledge let me waste hours and I was wondering if someone could point me in the correct direction.

like image 748
user695652 Avatar asked May 27 '13 22:05

user695652


1 Answers

It sounds like you might do well to just use the cost function (i.e. the one named cost) defined further down in the "Examples" section of ?cv.glm. Quoting from that section:

 # [...] Since the response is a binary variable an
 # appropriate cost function is
 cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)

This does essentially what you were trying to do with your example. Replacing your "no" and "yes" with 0 and 1, lets say you have two vectors, predict and response. Then cost() is nicely designed to take them and return the mean classification rate:

## Simulate some reasonable data
set.seed(1)
predict <- seq(0.1, 0.9, by=0.1)
response <-  rbinom(n=length(predict), prob=predict, size=1)
response
# [1] 0 0 0 1 0 0 0 1 1

## Demonstrate the function 'cost()' in action
cost(response, predict)
# [1] 0.3333333  ## Which is right, as 3/9 elements (4, 6, & 7) are misclassified
                 ## (assuming you use 0.5 as the cutoff for your predictions).

I'm guessing the trickiest bit of this will be just getting your mind fully wrapped around the idea of passing a function in as an argument. (At least that was for me, for the longest time, the hardest part of using the boot package, which requires that move in a fair number of places.)


Added on 2016-03-22:

The function cost(), given above is in my opinion unnecessarily obfuscated; the following alternative does exactly the same thing but in a more expressive way:

cost <- function(r, pi = 0) { 
        mean((pi < 0.5) & r==1 | (pi > 0.5) & r==0)
}
like image 122
Josh O'Brien Avatar answered Oct 15 '22 17:10

Josh O'Brien