Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - using glm inside a data.table

Tags:

r

data.table

glm

I'm trying to do some glm's inside a data.table to produce modelled results split by key factors.

I've been doing this sucessfully for:

  • High level glm

    glm(modellingDF,formula=Outcome~IntCol + DecCol,family=binomial(link=logit))

  • Scoped glm with single columns

    modellingDF[,list(Outcome, fitted=glm(x,formula=Outcome~IntCol ,family=binomial(link=logit))$fitted ), by=variable]

  • Scoped glm with two integer columns

    modellingDF[,list(Outcome, fitted=glm(x,formula=Outcome~IntCol + IntCol2 ,family=binomial(link=logit))$fitted ), by=variable]

But, when I try and do the high level glm inside the scope with my decimal column, it produces this error

Error in model.frame.default(formula = Outcome ~ IntCol + DecCol, data = x,  : 
  variable lengths differ (found for 'DecCol')

I thought perhaps it was due to variable lengths of the partitions, so I tested with a reproducible example:

library("data.table")

testing<-data.table(letters=sample(rep(LETTERS,5000),5000),
                    letters2=sample(rep(LETTERS[1:5],10000),5000), 
                    cont.var=rnorm(5000),
                    cont.var2=round(rnorm(5000)*1000,0),
                    outcome=rbinom(5000,1,0.8)
                    ,key="letters")
testing.glm<-testing[,list(outcome,
                  fitted=glm(x,formula=outcome~cont.var+cont.var2,family=binomial(link=logit))$fitted)
        ),by=list(letters)]

But this did not have the error. I thought maybe it was due to NAs or something but a summary of the data.table modellingDF gives no indication that there should be any issues:

DecCol
Min.   :0.0416
1st Qu.:0.6122
Median :0.7220
Mean   :0.6794
3rd Qu.:0.7840
Max.   :0.9495

nrow(modellingDF[is.na(DecCol),])   # results in 0

modellingDF[,list(len=.N,DecCollen=length(DecCol),IntCollen=length
(IntCol ),Outcomelen=length(Outcome)),by=Bracket]

  Bracket  len DecCollen IntCollen Outcomelen
1:     3-6 39184  39184       39184      39184
2:     1-2 19909  19909       19909      19909
3:       0  9912   9912        9912       9912

Perhaps I'm having a dozy day, but could anyone suggest a solution or a means for digging into this issue further?

like image 913
Steph Locke Avatar asked Sep 25 '13 09:09

Steph Locke


Video Answer


1 Answers

You need to correctly specify the data argument within glm. Inside a data.table (using [), this is referenced by .SD. (see create a formula in a data.table environment in R for related question)

So

modellingDF[,list(Outcome, fitted = glm(data = .SD, 
  formula = Outcome ~ IntCol ,family = binomial(link = logit))$fitted),
 by=variable]

will work.

While in this case (simply extracting the fitted values and moving on), this approach is sound, using data.table and .SD can get in a mess of environments if you are saving the whole model and then attempting to update it (see Why is using update on a lm inside a grouped data.table losing its model data?)

like image 98
mnel Avatar answered Sep 21 '22 04:09

mnel