Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R data.table: Difference between nested regressions results

Tags:

r

data.table

I am comparing two alternative strategies to estimate linear regression models on subsets of data using the data.table package for R. The two strategies produce the same coefficients, so they appear equivalent. This appearance is deceiving. My question is:

Why is the data stored inside the lm models different?

library(data.table)

dat = data.table(mtcars)

# strategy 1
mod1 = dat[, .(models = .(lm(hp ~ mpg, data = .SD))), by = vs]

# strategy 2
mod2 = dat[, .(data = .(.SD)), by = vs][
           , models := lapply(data, function(x) lm(hp ~ mpg, x))]

At first glance, the two approaches seem to produce identical results:

# strategy 1
coef(mod1$models[[1]])
#> (Intercept)         mpg 
#>   357.97866   -10.12576

# strategy 2
coef(mod2$models[[1]])
#> (Intercept)         mpg 
#>   357.97866   -10.12576

However, if I try to extract data from the (expanded) model.frame, I get different results:

# strategy 1
expanded_frame1 = expand.model.frame(mod1$models[[1]], "am")
table(expanded_frame1$am)
#> 
#>  0  1 
#>  7 11

# strategy 2
expanded_frame2 = expand.model.frame(mod2$models[[1]], "am")
table(expanded_frame2$am)
#> 
#>  0  1 
#> 12  6

This is a trivial minimal working example. My real use-case is that I obtained radically different results when applying sandwich::vcovCL to computed clustered standard errors for my models.

Edit:

I'm accepting the answer by @TimTeaFan (excellent detective work!) but adding a bit of useful info here for future readers.

As @achim-zeileis pointed out elsewhere, we can replicate a similar behavior in the global environment:

d <- subset(mtcars, subst = vs == 0)
m0 <- lm(hp ~ mpg, data = d)
d <- mtcars[0, ]
expand.model.frame(m0, "am")

[1] hp  mpg am
<0 rows> (or 0-length row.names)

This does not appear to be a data.table-specific issue. And in general, we have to be careful when re-evaluating the data from a model.

like image 338
Vincent Avatar asked Jan 30 '21 18:01

Vincent


1 Answers

I don't have a complete answer, but I was able to pinpoint the problem to some extent.

When we compare the output of the two models, we can see that the result is equal except for the calls, which are different (which makes sense, since they actually are different):

# compare models
purrr::map2(mod1$models[[1]], mod2$models[[1]], all.equal)
#> $coefficients
#> [1] TRUE
#> 
#> $residuals
#> [1] TRUE
#> 
#> $effects
#> [1] TRUE
#> 
#> $rank
#> [1] TRUE
#> 
#> $fitted.values
#> [1] TRUE
#> 
#> $assign
#> [1] TRUE
#> 
#> $qr
#> [1] TRUE
#> 
#> $df.residual
#> [1] TRUE
#> 
#> $xlevels
#> [1] TRUE
#> 
#> $call
#> [1] "target, current do not match when deparsed"
#> 
#> $terms
#> [1] TRUE
#> 
#> $model
#> [1] TRUE

So it seems that the initial call is working correctly with both approaches, the problem arises once we try to access the underlying data.

If we have a look at how expand.model.frame gets its data, we can see that it calls eval(model$call$data, envir) where envir is defined as environment(formula(model)) the environment associated with the formula of the lm object.

If we have a look at the data in the associated environment of each model and compare it with the data we expect it to hold, we can see that the second approach yields the data we expect, while the first approach using .SD in the call yields some different data.

It is still not clear to me, why and what is happening, but we now know the problem is in the call to .SD. I first thought, it might be caused by naming a data.table .SD, but after playing around with models where the data is a data.table called .SD this does not seem to be the issue.

# data of model 2 (identical to subsetted mtcars)
environment(formula(mod2$models[[1]]))$x[order(mpg),]
#>      mpg cyl  disp  hp drat    wt  qsec am gear carb
#>  1: 10.4   8 472.0 205 2.93 5.250 17.98  0    3    4
#>  2: 10.4   8 460.0 215 3.00 5.424 17.82  0    3    4
#>  3: 13.3   8 350.0 245 3.73 3.840 15.41  0    3    4
#>  4: 14.3   8 360.0 245 3.21 3.570 15.84  0    3    4
#>  5: 14.7   8 440.0 230 3.23 5.345 17.42  0    3    4
#>  6: 15.0   8 301.0 335 3.54 3.570 14.60  1    5    8
#>  7: 15.2   8 275.8 180 3.07 3.780 18.00  0    3    3
#>  8: 15.2   8 304.0 150 3.15 3.435 17.30  0    3    2
#>  9: 15.5   8 318.0 150 2.76 3.520 16.87  0    3    2
#> 10: 15.8   8 351.0 264 4.22 3.170 14.50  1    5    4
#> 11: 16.4   8 275.8 180 3.07 4.070 17.40  0    3    3
#> 12: 17.3   8 275.8 180 3.07 3.730 17.60  0    3    3
#> 13: 18.7   8 360.0 175 3.15 3.440 17.02  0    3    2
#> 14: 19.2   8 400.0 175 3.08 3.845 17.05  0    3    2
#> 15: 19.7   6 145.0 175 3.62 2.770 15.50  1    5    6
#> 16: 21.0   6 160.0 110 3.90 2.620 16.46  1    4    4
#> 17: 21.0   6 160.0 110 3.90 2.875 17.02  1    4    4
#> 18: 26.0   4 120.3  91 4.43 2.140 16.70  1    5    2

# subset and order mtcars data
mtcars_vs0 <- subset(mtcars, vs == 0)
mtcars_vs0[order(mtcars_vs0$mpg), ]
#>                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
#> Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
#> Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
#> Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#> Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
#> Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
#> Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
#> AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
#> Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
#> Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
#> Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
#> Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#> Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
#> Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2

# data of model 1 (not identical to mtcars)
environment(formula(mod1$models[[1]]))$.SD[order(mpg),]
#>      mpg cyl  disp  hp drat    wt  qsec am gear carb
#>  1: 15.0   8 301.0 335 3.54 3.570 14.60  1    5    8
#>  2: 15.8   8 351.0 264 4.22 3.170 14.50  1    5    4
#>  3: 17.8   6 167.6 123 3.92 3.440 18.90  0    4    4
#>  4: 18.1   6 225.0 105 2.76 3.460 20.22  0    3    1
#>  5: 19.2   6 167.6 123 3.92 3.440 18.30  0    4    4
#>  6: 19.7   6 145.0 175 3.62 2.770 15.50  1    5    6
#>  7: 21.4   6 258.0 110 3.08 3.215 19.44  0    3    1
#>  8: 21.4   4 121.0 109 4.11 2.780 18.60  1    4    2
#>  9: 21.5   4 120.1  97 3.70 2.465 20.01  0    3    1
#> 10: 22.8   4 108.0  93 3.85 2.320 18.61  1    4    1
#> 11: 22.8   4 140.8  95 3.92 3.150 22.90  0    4    2
#> 12: 24.4   4 146.7  62 3.69 3.190 20.00  0    4    2
#> 13: 26.0   4 120.3  91 4.43 2.140 16.70  1    5    2
#> 14: 27.3   4  79.0  66 4.08 1.935 18.90  1    4    1
#> 15: 30.4   4  75.7  52 4.93 1.615 18.52  1    4    2
#> 16: 30.4   4  95.1 113 3.77 1.513 16.90  1    5    2
#> 17: 32.4   4  78.7  66 4.08 2.200 19.47  1    4    1
#> 18: 33.9   4  71.1  65 4.22 1.835 19.90  1    4    1

Add on

I tried digging a little deeper to see whats going on. First I called debug(as.formula) and then looked at the following objects in each iteration:

object
ls(environment(object))

We can see that in "strategy 2" each formula is associated with a different environment, and when looking at the environment we see it contains one object x, which when inspected (environment(object)$x) contains the expected mtcars data.

In "strategy 1" however, we can observe that each call to as.formula associates the same environment with the formula being created. Further, when inspecting the environment we can see that it is populated with the single vectors of the subsetted mtcars data (e.g. am, carb, cyl etc.) as well as some functions (e.g. .POSIXt, Cfastmean, strptime etc.). This is probably where things go awry. I would suspect that when associating the same environment with two different formulas (models), the first models underlying data gets "updated" when the second model is calculated. This should also be the reason why the model output itself is correct. To the time the first model is being calculated, the data is still correct. It is overwritten by the second model, which therefore is correct, too. But when accessing the underlying data afterwards things get messy.

Side note
I was curious if we can observe similar problems and differences in the tidyverse when using expand.model.frame and the answer is "yes". Here, the new rowwise notation throws an error, while the group_map as well as the map approach work:

# dplyr approaches:

# group_map: works
mod3 <- mtcars %>%
  group_by(vs) %>%
  group_map(~ lm(hp ~ mpg, data = .x))

expand.model.frame(mod3[[1]], "am")

# mutate / rowwise: does not work
mod4 <- mtcars %>%
  nest_by(vs) %>%
  mutate(models = list(lm(hp ~ mpg, data = data)))

expand.model.frame(mod4$models[[1]], "am")

# mutate / map: works
mod5 <- mtcars %>%
  tidyr::nest(data = !vs) %>%
  mutate(models = purrr::map(data, ~ lm(hp ~ mpg, data = .x)))
  
expand.model.frame(mod5$models[[1]], "am")
like image 98
TimTeaFan Avatar answered Oct 11 '22 07:10

TimTeaFan