Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cross Validation function for logistic regression in R

I Come from a predominantly python + scikit learn background, and I was wondering how would one obtain the cross validation accuracy for a logistic regression model in R? I was searching and surprised that there's no easy way to this. I'm looking for the equivalent:

import pandas as pd
from sklearn.cross_validation import cross_val_score
from sklearn.linear_model import LogisticRegression

## Assume pandas dataframe of dataset and target exist.

scores = cross_val_score(LogisticRegression(),dataset,target,cv=10)
print(scores)

For R: I have:

model = glm(df$Y~df$X,family=binomial')
summary(model) 

And now I'm stuck. Reason being, the deviance for my R model is 1900, implying its a bad fit, but the python one gives me 85% 10 fold cross validation accuracy.. which means its good. Seems a bit strange... So i wanted to run cross val in R to see if its the same result.

Any help is appreciated!

like image 415
John Bennet Avatar asked Dec 10 '22 14:12

John Bennet


2 Answers

R version using caret package:

library(caret)

# define training control
train_control <- trainControl(method = "cv", number = 10)

# train the model on training set
model <- train(target ~ .,
               data = train,
               trControl = train_control,
               method = "glm",
               family=binomial())

# print cv scores
summary(model)
like image 119
Sandipan Dey Avatar answered Jan 04 '23 23:01

Sandipan Dey


Below I took an answer from here and made a few changes.

The changes I made were to make it a logit (logistic) model, add modeling and prediction, store the CV's results, and to make it a fully working example.

Also note that there are many packages and functions you could use, including cv.glm() from boot.

data(ChickWeight)

df                    <- ChickWeight
df$Y                  <- 0
df$Y[df$weight > 100] <- 1
df$X                  <- df$Diet 

df     <- df[sample(nrow(df)),]
folds  <- cut(seq(1,nrow(df)),breaks=10,labels=FALSE)
result <- list()

for(i in 1:10){
  testIndexes <- which(folds==i,arr.ind=TRUE)
  testData    <- df[testIndexes, ]
  trainData   <- df[-testIndexes, ]
  model       <- glm(Y~X,family=binomial,data=trainData)
  result[[i]] <- predict(model, testData) 
}
result

You could add a line to calculate accuracy within the loop or just do it after the loop completes.

like image 36
Hack-R Avatar answered Jan 05 '23 01:01

Hack-R