I am comparing two alternative strategies to estimate linear regression models on subsets of data using the data.table
package for R
. The two strategies produce the same coefficients, so they appear equivalent. This appearance is deceiving. My question is:
Why is the data stored inside the
lm
models different?
library(data.table)
dat = data.table(mtcars)
# strategy 1
mod1 = dat[, .(models = .(lm(hp ~ mpg, data = .SD))), by = vs]
# strategy 2
mod2 = dat[, .(data = .(.SD)), by = vs][
, models := lapply(data, function(x) lm(hp ~ mpg, x))]
At first glance, the two approaches seem to produce identical results:
# strategy 1
coef(mod1$models[[1]])
#> (Intercept) mpg
#> 357.97866 -10.12576
# strategy 2
coef(mod2$models[[1]])
#> (Intercept) mpg
#> 357.97866 -10.12576
However, if I try to extract data from the (expanded) model.frame, I get different results:
# strategy 1
expanded_frame1 = expand.model.frame(mod1$models[[1]], "am")
table(expanded_frame1$am)
#>
#> 0 1
#> 7 11
# strategy 2
expanded_frame2 = expand.model.frame(mod2$models[[1]], "am")
table(expanded_frame2$am)
#>
#> 0 1
#> 12 6
This is a trivial minimal working example. My real use-case is that I obtained radically different results when applying sandwich::vcovCL
to computed clustered standard errors for my models.
I'm accepting the answer by @TimTeaFan (excellent detective work!) but adding a bit of useful info here for future readers.
As @achim-zeileis pointed out elsewhere, we can replicate a similar behavior in the global environment:
d <- subset(mtcars, subst = vs == 0)
m0 <- lm(hp ~ mpg, data = d)
d <- mtcars[0, ]
expand.model.frame(m0, "am")
[1] hp mpg am
<0 rows> (or 0-length row.names)
This does not appear to be a data.table
-specific issue. And in general, we have to be careful when re-evaluating the data from a model.
I don't have a complete answer, but I was able to pinpoint the problem to some extent.
When we compare the output of the two models, we can see that the result is equal except for the calls, which are different (which makes sense, since they actually are different):
# compare models
purrr::map2(mod1$models[[1]], mod2$models[[1]], all.equal)
#> $coefficients
#> [1] TRUE
#>
#> $residuals
#> [1] TRUE
#>
#> $effects
#> [1] TRUE
#>
#> $rank
#> [1] TRUE
#>
#> $fitted.values
#> [1] TRUE
#>
#> $assign
#> [1] TRUE
#>
#> $qr
#> [1] TRUE
#>
#> $df.residual
#> [1] TRUE
#>
#> $xlevels
#> [1] TRUE
#>
#> $call
#> [1] "target, current do not match when deparsed"
#>
#> $terms
#> [1] TRUE
#>
#> $model
#> [1] TRUE
So it seems that the initial call is working correctly with both approaches, the problem arises once we try to access the underlying data.
If we have a look at how expand.model.frame
gets its data, we can see that it calls eval(model$call$data, envir)
where envir
is defined as environment(formula(model))
the environment associated with the formula of the lm
object.
If we have a look at the data in the associated environment of each model and compare it with the data we expect it to hold, we can see that the second approach yields the data we expect, while the first approach using .SD
in the call yields some different data.
It is still not clear to me, why and what is happening, but we now know the problem is in the call to .SD
. I first thought, it might be caused by naming a data.table
.SD
, but after playing around with models where the data is a data.table
called .SD
this does not seem to be the issue.
# data of model 2 (identical to subsetted mtcars)
environment(formula(mod2$models[[1]]))$x[order(mpg),]
#> mpg cyl disp hp drat wt qsec am gear carb
#> 1: 10.4 8 472.0 205 2.93 5.250 17.98 0 3 4
#> 2: 10.4 8 460.0 215 3.00 5.424 17.82 0 3 4
#> 3: 13.3 8 350.0 245 3.73 3.840 15.41 0 3 4
#> 4: 14.3 8 360.0 245 3.21 3.570 15.84 0 3 4
#> 5: 14.7 8 440.0 230 3.23 5.345 17.42 0 3 4
#> 6: 15.0 8 301.0 335 3.54 3.570 14.60 1 5 8
#> 7: 15.2 8 275.8 180 3.07 3.780 18.00 0 3 3
#> 8: 15.2 8 304.0 150 3.15 3.435 17.30 0 3 2
#> 9: 15.5 8 318.0 150 2.76 3.520 16.87 0 3 2
#> 10: 15.8 8 351.0 264 4.22 3.170 14.50 1 5 4
#> 11: 16.4 8 275.8 180 3.07 4.070 17.40 0 3 3
#> 12: 17.3 8 275.8 180 3.07 3.730 17.60 0 3 3
#> 13: 18.7 8 360.0 175 3.15 3.440 17.02 0 3 2
#> 14: 19.2 8 400.0 175 3.08 3.845 17.05 0 3 2
#> 15: 19.7 6 145.0 175 3.62 2.770 15.50 1 5 6
#> 16: 21.0 6 160.0 110 3.90 2.620 16.46 1 4 4
#> 17: 21.0 6 160.0 110 3.90 2.875 17.02 1 4 4
#> 18: 26.0 4 120.3 91 4.43 2.140 16.70 1 5 2
# subset and order mtcars data
mtcars_vs0 <- subset(mtcars, vs == 0)
mtcars_vs0[order(mtcars_vs0$mpg), ]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
#> Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
#> Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
#> Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#> Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
#> Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
#> AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
#> Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
#> Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#> Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
#> Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# data of model 1 (not identical to mtcars)
environment(formula(mod1$models[[1]]))$.SD[order(mpg),]
#> mpg cyl disp hp drat wt qsec am gear carb
#> 1: 15.0 8 301.0 335 3.54 3.570 14.60 1 5 8
#> 2: 15.8 8 351.0 264 4.22 3.170 14.50 1 5 4
#> 3: 17.8 6 167.6 123 3.92 3.440 18.90 0 4 4
#> 4: 18.1 6 225.0 105 2.76 3.460 20.22 0 3 1
#> 5: 19.2 6 167.6 123 3.92 3.440 18.30 0 4 4
#> 6: 19.7 6 145.0 175 3.62 2.770 15.50 1 5 6
#> 7: 21.4 6 258.0 110 3.08 3.215 19.44 0 3 1
#> 8: 21.4 4 121.0 109 4.11 2.780 18.60 1 4 2
#> 9: 21.5 4 120.1 97 3.70 2.465 20.01 0 3 1
#> 10: 22.8 4 108.0 93 3.85 2.320 18.61 1 4 1
#> 11: 22.8 4 140.8 95 3.92 3.150 22.90 0 4 2
#> 12: 24.4 4 146.7 62 3.69 3.190 20.00 0 4 2
#> 13: 26.0 4 120.3 91 4.43 2.140 16.70 1 5 2
#> 14: 27.3 4 79.0 66 4.08 1.935 18.90 1 4 1
#> 15: 30.4 4 75.7 52 4.93 1.615 18.52 1 4 2
#> 16: 30.4 4 95.1 113 3.77 1.513 16.90 1 5 2
#> 17: 32.4 4 78.7 66 4.08 2.200 19.47 1 4 1
#> 18: 33.9 4 71.1 65 4.22 1.835 19.90 1 4 1
I tried digging a little deeper to see whats going on. First I called debug(as.formula)
and then looked at the following objects in each iteration:
object
ls(environment(object))
We can see that in "strategy 2" each formula is associated with a different environment, and when looking at the environment we see it contains one object x
, which when inspected (environment(object)$x
) contains the expected mtcars
data.
In "strategy 1" however, we can observe that each call to as.formula
associates the same environment with the formula being created. Further, when inspecting the environment we can see that it is populated with the single vectors of the subsetted mtcars
data (e.g. am
, carb
, cyl
etc.) as well as some functions (e.g. .POSIXt
, Cfastmean
, strptime
etc.). This is probably where things go awry. I would suspect that when associating the same environment with two different formulas (models), the first models underlying data gets "updated" when the second model is calculated. This should also be the reason why the model output itself is correct. To the time the first model is being calculated, the data is still correct. It is overwritten by the second model, which therefore is correct, too. But when accessing the underlying data afterwards things get messy.
Side note
I was curious if we can observe similar problems and differences in the tidyverse when using expand.model.frame
and the answer is "yes". Here, the new rowwise
notation throws an error, while the group_map
as well as the map
approach work:
# dplyr approaches:
# group_map: works
mod3 <- mtcars %>%
group_by(vs) %>%
group_map(~ lm(hp ~ mpg, data = .x))
expand.model.frame(mod3[[1]], "am")
# mutate / rowwise: does not work
mod4 <- mtcars %>%
nest_by(vs) %>%
mutate(models = list(lm(hp ~ mpg, data = data)))
expand.model.frame(mod4$models[[1]], "am")
# mutate / map: works
mod5 <- mtcars %>%
tidyr::nest(data = !vs) %>%
mutate(models = purrr::map(data, ~ lm(hp ~ mpg, data = .x)))
expand.model.frame(mod5$models[[1]], "am")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With