I'm trying to do some glm's inside a data.table to produce modelled results split by key factors.
I've been doing this sucessfully for:
High level glm
glm(modellingDF,formula=Outcome~IntCol + DecCol,family=binomial(link=logit))
Scoped glm with single columns
modellingDF[,list(Outcome, fitted=glm(x,formula=Outcome~IntCol ,family=binomial(link=logit))$fitted ), by=variable]
Scoped glm with two integer columns
modellingDF[,list(Outcome, fitted=glm(x,formula=Outcome~IntCol + IntCol2 ,family=binomial(link=logit))$fitted ), by=variable]
But, when I try and do the high level glm inside the scope with my decimal column, it produces this error
Error in model.frame.default(formula = Outcome ~ IntCol + DecCol, data = x, :
variable lengths differ (found for 'DecCol')
I thought perhaps it was due to variable lengths of the partitions, so I tested with a reproducible example:
library("data.table")
testing<-data.table(letters=sample(rep(LETTERS,5000),5000),
letters2=sample(rep(LETTERS[1:5],10000),5000),
cont.var=rnorm(5000),
cont.var2=round(rnorm(5000)*1000,0),
outcome=rbinom(5000,1,0.8)
,key="letters")
testing.glm<-testing[,list(outcome,
fitted=glm(x,formula=outcome~cont.var+cont.var2,family=binomial(link=logit))$fitted)
),by=list(letters)]
But this did not have the error. I thought maybe it was due to NAs or something but a summary of the data.table modellingDF gives no indication that there should be any issues:
DecCol
Min. :0.0416
1st Qu.:0.6122
Median :0.7220
Mean :0.6794
3rd Qu.:0.7840
Max. :0.9495
nrow(modellingDF[is.na(DecCol),]) # results in 0
modellingDF[,list(len=.N,DecCollen=length(DecCol),IntCollen=length
(IntCol ),Outcomelen=length(Outcome)),by=Bracket]
Bracket len DecCollen IntCollen Outcomelen
1: 3-6 39184 39184 39184 39184
2: 1-2 19909 19909 19909 19909
3: 0 9912 9912 9912 9912
Perhaps I'm having a dozy day, but could anyone suggest a solution or a means for digging into this issue further?
You need to correctly specify the data
argument within glm
. Inside a data.table
(using [
), this is referenced by .SD
. (see create a formula in a data.table environment in R for related question)
So
modellingDF[,list(Outcome, fitted = glm(data = .SD,
formula = Outcome ~ IntCol ,family = binomial(link = logit))$fitted),
by=variable]
will work.
While in this case (simply extracting the fitted values and moving on), this approach is sound, using data.table
and .SD
can get in a mess of environments if you are saving the whole model and then attempting to update
it (see Why is using update on a lm inside a grouped data.table losing its model data?)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With