Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fitting several regression models by changing only one independent variable within mutate()

I suspect that this question might be a duplicate, however, I found nothing satisfactory. Imagine a simple dataset with a structure like this:

set.seed(123)
df <- data.frame(cov_a = rbinom(100, 1, prob = 0.5),
                 cov_b = rbinom(100, 1, prob = 0.5),
                 cont_a  = runif(100),
                 cont_b = runif(100),
                 dep = runif(100))

    cov_a cov_b      cont_a      cont_b          dep
1       0     1 0.238726027 0.784575267 0.9860542973
2       1     0 0.962358936 0.009429905 0.1370674714
3       0     0 0.601365726 0.779065883 0.9053095817
4       1     1 0.515029727 0.729390652 0.5763018376
5       1     0 0.402573342 0.630131853 0.3954488591
6       0     1 0.880246541 0.480910830 0.4498024841
7       1     1 0.364091865 0.156636851 0.7065019011
8       1     1 0.288239281 0.008215520 0.0825027458
9       1     0 0.170645235 0.452458394 0.3393125802
10      0     0 0.172171746 0.492293329 0.6807875512

What I'm looking for is an elegant dplyr/tidyverse option to fit a separate regression model for every cov_ variable, while including the same set of additional variables and the same dependent variable.

I'm able to solve this problem using this code (require purrr, dplyr, tidyr and broom):

map(.x = names(df)[grepl("cov_", names(df))],
    ~ df %>%
     nest() %>%
     mutate(res = map(data, function(y) tidy(lm(dep ~ cont_a + cont_b + !!sym(.x), data = y)))) %>%
     unnest(res))

[[1]]
# A tibble: 4 x 6
  data               term        estimate std.error statistic      p.value
  <list>             <chr>          <dbl>     <dbl>     <dbl>        <dbl>
1 <tibble [100 × 5]> (Intercept)   0.472     0.0812     5.81  0.0000000799
2 <tibble [100 × 5]> cont_a       -0.103     0.0983    -1.05  0.296       
3 <tibble [100 × 5]> cont_b        0.172     0.0990     1.74  0.0848      
4 <tibble [100 × 5]> cov_a        -0.0455    0.0581    -0.783 0.436       

[[2]]
# A tibble: 4 x 6
  data               term        estimate std.error statistic     p.value
  <list>             <chr>          <dbl>     <dbl>     <dbl>       <dbl>
1 <tibble [100 × 5]> (Intercept)   0.415     0.0787     5.27  0.000000846
2 <tibble [100 × 5]> cont_a       -0.0874    0.0984    -0.888 0.377      
3 <tibble [100 × 5]> cont_b        0.181     0.0980     1.84  0.0682     
4 <tibble [100 × 5]> cov_b         0.0482    0.0576     0.837 0.405 

However, I would like to avoid the use of double-map() and solve it by using a somehow more direct or elegant approach.

like image 488
tmfmnk Avatar asked Jan 06 '21 21:01

tmfmnk


3 Answers

I'm not sure if this will be considered more direct/elegant but here is my solution that does not use a double map:

library(tidyverse)
library(broom)

gen_model_expr <- function(var) {
  form = paste("dep ~ cont_a + cont_b +", var)
  tidy(lm(as.formula(form), data = df))
}

grep("cov_", names(df), value = TRUE) %>%
  map(gen_model_expr)

Output (Note that it does not retain the data column):

[[1]]
# A tibble: 4 x 5
  term        estimate std.error statistic      p.value
  <chr>          <dbl>     <dbl>     <dbl>        <dbl>
1 (Intercept)   0.472     0.0812     5.81  0.0000000799
2 cont_a       -0.103     0.0983    -1.05  0.296       
3 cont_b        0.172     0.0990     1.74  0.0848      
4 cov_a        -0.0455    0.0581    -0.783 0.436       

[[2]]
# A tibble: 4 x 5
  term        estimate std.error statistic     p.value
  <chr>          <dbl>     <dbl>     <dbl>       <dbl>
1 (Intercept)   0.415     0.0787     5.27  0.000000846
2 cont_a       -0.0874    0.0984    -0.888 0.377      
3 cont_b        0.181     0.0980     1.84  0.0682     
4 cov_b         0.0482    0.0576     0.837 0.405 

EDIT

In the interest of speed performance (inspired by @TimTeaFan), a benchmark test comparing the different ways to retrieve the covariate names is shown below. grep("cov_", names(df), value = TRUE) is the fastest

# A tibble: 4 x 13
  expression                                         min median `itr/sec` mem_alloc
  <bch:expr>                                      <bch:> <bch:>     <dbl> <bch:byt>
1 names(df)[grepl("cov_", names(df))]             7.59µs  8.4µs   101975.        0B
2 grep("cov_", colnames(df), value = TRUE)        8.21µs 8.96µs   103142.        0B
3 grep("cov_", names(df), value = TRUE)           6.96µs 7.43µs   128694.        0B
4 df %>% select(starts_with("cov_")) %>% colnames 1.17ms 1.33ms      636.    5.39KB
like image 59
latlio Avatar answered Oct 21 '22 01:10

latlio


Not directly tidyverse-based but: multiple estimations can be performed at once with the package fixest.

The syntax is explicit and the solution to the question asked can be written in one line of code:

library(fixest)

set.seed(123)
n = 100
df = data.frame(cov_a = rbinom(n, 1, prob = 0.5),
                cov_b = rbinom(n, 1, prob = 0.5),
                cont_a  = runif(n), cont_b = runif(n),
                dep = runif(n))

# Estimation: sw means stepwise
res = feols(dep ~ sw(cov_a, cov_b) + cont_a + cont_b, df)

That's it: the two estimations are done. (Note that feols is like lm.)

Then you can display the results:

# Display the results
etable(res, order = "Int|cov")
#>                            model 1            model 2
#> Dependent Var.:                dep                dep
#>                                                      
#> (Intercept)     0.4722*** (0.0812) 0.4148*** (0.0788)
#> cov_a             -0.0455 (0.0581)                   
#> cov_b                                 0.0482 (0.0576)
#> cont_a            -0.1033 (0.0983)   -0.0874 (0.0984)
#> cont_b            0.1723. (0.0990)   0.1808. (0.0980)
#> _______________ __________________ __________________
#> S.E. type                 Standard           Standard
#> Observations                   100                100
#> R2                         0.04823            0.04910
#> Adj. R2                    0.01849            0.01938

And easily get them to the tidyverse:

# Get the results into tidyverse
library(broom)
lapply(as.list(res), function(x) tidy(x))
#> [[1]]
#> # A tibble: 4 x 5
#>   term        estimate std.error statistic      p.value
#>   <chr>          <dbl>     <dbl>     <dbl>        <dbl>
#> 1 (Intercept)   0.472     0.0812     5.81  0.0000000799
#> 2 cov_a        -0.0455    0.0581    -0.783 0.436       
#> 3 cont_a       -0.103     0.0983    -1.05  0.296       
#> 4 cont_b        0.172     0.0990     1.74  0.0848      
#> 
#> [[2]]
#> # A tibble: 4 x 5
#>   term        estimate std.error statistic     p.value
#>   <chr>          <dbl>     <dbl>     <dbl>       <dbl>
#> 1 (Intercept)   0.415     0.0787     5.27  0.000000846
#> 2 cov_b         0.0482    0.0576     0.837 0.405      
#> 3 cont_a       -0.0874    0.0984    -0.888 0.377      
#> 4 cont_b        0.181     0.0980     1.84  0.0682

Here's the documentation on multiple estimations. You may also, still at once, apply multiple estimations to dependent variables/fixed-effects (if appropriate)/split sample--all compactly.

Last thing: multiple estimations are highly optimized, so you'll get high performance gains vis à vis loop based solutions (like map).

like image 30
Laurent Bergé Avatar answered Oct 21 '22 00:10

Laurent Bergé


I throw in two approaches:

The first one is rather boring. We can use {dplyr}'s rowwise notation instead of purrr::map. This approach comes in two flavors. After rowwise we can use either (A) mutate %>% unnest or, (B), we can use group_map. In both approaches I refrained from duplicating the data, but we can easily include the data in each row if needed (when setting up the the tibble we can do tibble(myvar = ., data = list(df))). While (A) gives us one tibble that includes all the data, the group_map in (B) returns a list similar to the "double map" approach from the original example.

The second approach I'd consider as rather "fresh" (although less straight-forward), since it neither uses rowwise nor map. Here we use {dplyr}'s across function in combination with cur_column(), create a new column for each output and then pivot_longer and unnest to have all the results in one tibble.

The final benchmark shows: "doulbe_map" is slowest (because of the duplicate data column), followed by "across" and "row_unnest", while "row_group_map" is rather fast. Nevertheless, the fastest approach is the one by @latlio using map and a custom function (bellow called "map_custom_fun"), but although it is using purrr, it might be less "dplyr-ish".

library(tidyverse)
library(broom)

set.seed(123)
df <- data.frame(cov_a = rbinom(100, 1, prob = 0.5),
                 cov_b = rbinom(100, 1, prob = 0.5),
                 cont_a  = runif(100),
                 cont_b = runif(100),
                 dep = runif(100))

# first approach: using dplyr rowwise

# Variation A:
# rowwise %>% mutate %>% unnest
df %>% 
  select(starts_with("cov_")) %>% 
  colnames %>%
  tibble(myvar = .) %>%
  rowwise() %>% 
  mutate(res = list(tidy(lm(dep ~ cont_a + cont_b + eval(sym(.data$myvar)), data = df)))) %>% 
  unnest(res)
#> # A tibble: 8 x 6
#>   myvar term                   estimate std.error statistic      p.value
#>   <chr> <chr>                     <dbl>     <dbl>     <dbl>        <dbl>
#> 1 cov_a (Intercept)              0.472     0.0812     5.81  0.0000000799
#> 2 cov_a cont_a                  -0.103     0.0983    -1.05  0.296       
#> 3 cov_a cont_b                   0.172     0.0990     1.74  0.0848      
#> 4 cov_a eval(sym(.data$myvar))  -0.0455    0.0581    -0.783 0.436       
#> 5 cov_b (Intercept)              0.415     0.0787     5.27  0.000000846 
#> 6 cov_b cont_a                  -0.0874    0.0984    -0.888 0.377       
#> 7 cov_b cont_b                   0.181     0.0980     1.84  0.0682      
#> 8 cov_b eval(sym(.data$myvar))   0.0482    0.0576     0.837 0.405

# Variation B:
# rowwise %>% group_map
df %>% 
  select(starts_with("cov_")) %>% 
  colnames %>%
  tibble(myvar = .) %>%
  rowwise() %>% 
  group_map(.keep = TRUE,
            ~ tidy(lm(dep ~ cont_a + cont_b + eval(sym(myvar)), data = df)))
#> [[1]]
#> # A tibble: 4 x 5
#>   term                estimate std.error statistic      p.value
#>   <chr>                  <dbl>     <dbl>     <dbl>        <dbl>
#> 1 (Intercept)           0.472     0.0812     5.81  0.0000000799
#> 2 cont_a               -0.103     0.0983    -1.05  0.296       
#> 3 cont_b                0.172     0.0990     1.74  0.0848      
#> 4 eval(sym(.x$myvar))  -0.0455    0.0581    -0.783 0.436       
#> 
#> [[2]]
#> # A tibble: 4 x 5
#>   term                estimate std.error statistic     p.value
#>   <chr>                  <dbl>     <dbl>     <dbl>       <dbl>
#> 1 (Intercept)           0.415     0.0787     5.27  0.000000846
#> 2 cont_a               -0.0874    0.0984    -0.888 0.377      
#> 3 cont_b                0.181     0.0980     1.84  0.0682     
#> 4 eval(sym(.x$myvar))   0.0482    0.0576     0.837 0.405


# second approach: using summarise(across)
# we need a `tibble` here, otherwise printing gets messed up
df_tbl <- as_tibble(df)

df_tbl %>% 
  summarise(across(starts_with("cov_"), 
                   ~ list(tidy(lm(
                     reformulate(c("cont_a","cont_b", cur_column()), "dep"),
                     data = df_tbl))))) %>% 
  pivot_longer(cols = everything(),
               names_to = "var",
               values_to = "res") %>% 
  unnest(res)
#> # A tibble: 8 x 6
#>   var   term        estimate std.error statistic      p.value
#>   <chr> <chr>          <dbl>     <dbl>     <dbl>        <dbl>
#> 1 cov_a (Intercept)   0.472     0.0812     5.81  0.0000000799
#> 2 cov_a cont_a       -0.103     0.0983    -1.05  0.296       
#> 3 cov_a cont_b        0.172     0.0990     1.74  0.0848      
#> 4 cov_a cov_a        -0.0455    0.0581    -0.783 0.436       
#> 5 cov_b (Intercept)   0.415     0.0787     5.27  0.000000846 
#> 6 cov_b cont_a       -0.0874    0.0984    -0.888 0.377       
#> 7 cov_b cont_b        0.181     0.0980     1.84  0.0682      
#> 8 cov_b cov_b         0.0482    0.0576     0.837 0.405

Created on 2021-01-10 by the reprex package (v0.3.0)

Benchmark

#> Unit: milliseconds
#>            expr      min        lq      mean    median        uq      max neval
#>      double_map 20.92847 22.238626 23.645280 22.726705 25.002493 34.29351   100
#>      row_unnest 15.30179 15.835506 16.714873 16.358134 17.314802 20.60496   100
#>    row_groupmap 10.10168 10.490374 11.237398 10.709524 11.452677 20.40186   100
#>          across 16.47369 17.117178 18.593908 17.945136 19.431190 29.29384   100
#>  map_custom_fun  6.85758  7.311608  7.953066  7.611394  8.305757 19.57006   100
like image 28
TimTeaFan Avatar answered Oct 20 '22 23:10

TimTeaFan