Ok, this is a weird one. I suspect this is a bug inside data.table
, but it would be useful if anyone can explain why this is happening - what is update
doing exactly?
I'm using the list(list())
trick inside data.table
to store fitted models. When you create a sequence of lm
objects each for different groupings, and then update
those models, the model data for all models becomes that of the last grouping. This seems like a reference is hanging around somewhere where a copy should have been made, but I can't find where and I can't reproduce this outside of lm
and update
.
Concrete example:
Starting with the iris data, first make the three species different sample sizes, then fit an lm
model to each species, the update those models:
set.seed(3)
DT = data.table(iris)
DT = DT[rnorm(150) < 0.9]
fit = DT[, list(list(lm(Sepal.Length ~ Sepal.Width + Petal.Length))),
by = Species]
fit2 = fit[, list(list(update(V1[[1]], ~.-Sepal.Length))), by = Species]
The original data table has different numbers of each species
DT[,.N, by = Species]
# Species N
# 1: setosa 41
# 2: versicolor 39
# 3: virginica 42
And the first fit confirms thsi:
fit[, nobs(V1[[1]]), by = Species]
# Species V1
# 1: setosa 41
# 2: versicolor 39
# 3: virginica 42
But the updated second fit is showing 42 for all models
fit2[, nobs(V1[[1]]), by = Species]
# Species V1
# 1: setosa 42
# 2: versicolor 42
# 3: virginica 42
We can also look at the model attribute which contains the data used for fitting, and see that all the model are indeed using the final groups data. The question is how has this happened?
head(fit$V1[[1]]$model)
# Sepal.Length Sepal.Width Petal.Length
# 1 5.1 3.5 1.4
# 2 4.9 3.0 1.4
# 3 4.7 3.2 1.3
# 4 4.6 3.1 1.5
# 5 5.0 3.6 1.4
# 6 5.4 3.9 1.7
head(fit$V1[[3]]$model)
# Sepal.Length Sepal.Width Petal.Length
# 1 6.3 3.3 6.0
# 2 5.8 2.7 5.1
# 3 6.3 2.9 5.6
# 4 7.6 3.0 6.6
# 5 4.9 2.5 4.5
# 6 7.3 2.9 6.3
head(fit2$V1[[1]]$model)
# Sepal.Length Sepal.Width Petal.Length
# 1 6.3 3.3 6.0
# 2 5.8 2.7 5.1
# 3 6.3 2.9 5.6
# 4 7.6 3.0 6.6
# 5 4.9 2.5 4.5
# 6 7.3 2.9 6.3
head(fit2$V1[[3]]$model)
# Sepal.Length Sepal.Width Petal.Length
# 1 6.3 3.3 6.0
# 2 5.8 2.7 5.1
# 3 6.3 2.9 5.6
# 4 7.6 3.0 6.6
# 5 4.9 2.5 4.5
# 6 7.3 2.9 6.3
This is not an answer, but is too long for a comment
The .Environment
for the terms component is identical for each resulting model
e1 <- attr(fit[['V1']][[1]]$terms, '.Environment')
e2 <- attr(fit[['V1']][[2]]$terms, '.Environment')
e3 <- attr(fit[['V1']][[3]]$terms, '.Environment')
identical(e1,e2)
## TRUE
identical(e2, e3)
## TRUE
It appears that data.table
is using the same bit of memory (my non-technical term) for
each evaluation of j
by group (which is efficient). However when update
is called, it is using this to refit the model. This will contain the values from the last group.
So, if you fudge this, it will work
fit = DT[, { xx <-list2env(copy(.SD))
mymodel <-lm(Sepal.Length ~ Sepal.Width + Petal.Length)
attr(mymodel$terms, '.Environment') <- xx
list(list(mymodel))}, by= 'Species']
lfit2 <- fit[, list(list(update(V1[[1]], ~.-Sepal.Width))), by = Species]
lfit2[,lapply(V1,nobs)]
V1 V2 V3
1: 41 39 42
# using your exact diagnostic coding.
lfit2[,nobs(V1[[1]]),by = Species]
Species V1
1: setosa 41
2: versicolor 39
3: virginica 42
not a long term solution, but at least a workaround.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With