Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regression by group and retain all the columns in R

I am doing a linear regression by group and want to extract the residuals of the regression

library(dplyr)
set.seed(124)

dat <- data.frame(ID = sample(111:503, 18576, replace = T), 
                  ID2 = sample(11:50, 18576, replace = T), 
                  ID3 = sample(1:14, 18576, replace = T),
                  yearRef = sample(1998:2014, 18576, replace = T),
                  value = rnorm(18576))


resid <- dat %>% dplyr::group_by(ID3) %>% 
         do(augment(lm(value ~ yearRef, data=.))) %>% ungroup()

How do I retain the ID, ID2 as well in the resid. At the moment, it only retains the ID3 in the final data frame

like image 750
89_Simple Avatar asked Nov 22 '19 14:11

89_Simple


2 Answers

Use group_split then loop through each group using map_dfr to bind ID, ID2 and augment output using bind_cols

library(dplyr)
library(purrr)
dat %>% group_split(ID3) %>% 
   map_dfr(~bind_cols(select(.x,ID,ID2), augment(lm(value~yearRef, data=.x))), .id = "ID3")

# A tibble: 18,576 x 12
   ID3      ID   ID2   value yearRef .fitted .se.fit   .resid    .hat .sigma .cooksd
   <chr> <int> <int>   <dbl>   <int>   <dbl>   <dbl>    <dbl>   <dbl>  <dbl>   <dbl>
 1 1       196    16 -0.385     2009 -0.0406  0.0308 -0.344   1.00e-3  0.973 6.27e-5
 2 1       372    47 -0.793     2012 -0.0676  0.0414 -0.726   1.81e-3  0.973 5.05e-4
 3 1       470    15 -0.496     2011 -0.0586  0.0374 -0.438   1.48e-3  0.973 1.50e-4
 4 1       242    40 -1.13      2010 -0.0496  0.0338 -1.08    1.21e-3  0.973 7.54e-4
 5 1       471    34  1.28      2006 -0.0135  0.0262  1.29    7.26e-4  0.972 6.39e-4
 6 1       434    35 -1.09      1998  0.0586  0.0496 -1.15    2.61e-3  0.973 1.82e-3
 7 1       467    45 -0.0663    2011 -0.0586  0.0374 -0.00769 1.48e-3  0.973 4.64e-8
 8 1       334    27 -1.37      2003  0.0135  0.0305 -1.38    9.86e-4  0.972 9.92e-4
 9 1       186    25 -0.0195    2003  0.0135  0.0305 -0.0331  9.86e-4  0.973 5.71e-7
10 1       114    34  1.09      2014 -0.0857  0.0500  1.18    2.64e-3  0.973 1.94e-3
# ... with 18,566 more rows, and 1 more variable: .std.resid <dbl>
like image 126
A. Suliman Avatar answered Oct 21 '22 11:10

A. Suliman


Taking the "many models" approach, you can nest the data on ID3 and use purrr::map to create a list-column of the broom::augment data frames. The data list-column has all the original columns aside from ID3; map into that and select just the ones you want. Here I'm assuming you want to keep any column that starts with "ID", but you can change this. Then unnest both the data and the augment data frames.

library(dplyr)
library(tidyr)

dat %>%
  group_by(ID3) %>%
  nest() %>%
  mutate(aug = purrr::map(data, ~broom::augment(lm(value ~ yearRef, data = .))),
         data = purrr::map(data, select, starts_with("ID"))) %>%
  unnest(c(data, aug))
#> # A tibble: 18,576 x 12
#> # Groups:   ID3 [14]
#>      ID3    ID   ID2   value yearRef .fitted .se.fit  .resid    .hat .sigma
#>    <int> <int> <int>   <dbl>   <int>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
#>  1    11   431    15  0.619     2002  0.0326  0.0346  0.586  1.21e-3  0.995
#>  2    11   500    21 -0.432     2000  0.0299  0.0424 -0.462  1.82e-3  0.995
#>  3    11   392    28 -0.246     1998  0.0273  0.0515 -0.273  2.67e-3  0.995
#>  4    11   292    40 -0.425     1998  0.0273  0.0515 -0.452  2.67e-3  0.995
#>  5    11   175    36 -0.258     1999  0.0286  0.0468 -0.287  2.22e-3  0.995
#>  6    11   419    23  3.13      2005  0.0365  0.0273  3.09   7.54e-4  0.992
#>  7    11   329    17 -0.0414    2007  0.0391  0.0274 -0.0806 7.57e-4  0.995
#>  8    11   284    23 -0.450     2006  0.0378  0.0268 -0.488  7.25e-4  0.995
#>  9    11   136    28 -0.129     2006  0.0378  0.0268 -0.167  7.25e-4  0.995
#> 10    11   118    17 -1.55      2013  0.0470  0.0470 -1.60   2.24e-3  0.995
#> # … with 18,566 more rows, and 2 more variables: .cooksd <dbl>,
#> #   .std.resid <dbl>
like image 28
camille Avatar answered Oct 21 '22 11:10

camille