I suspect that this question might be a duplicate, however, I found nothing satisfactory. Imagine a simple dataset with a structure like this:
set.seed(123)
df <- data.frame(cov_a = rbinom(100, 1, prob = 0.5),
cov_b = rbinom(100, 1, prob = 0.5),
cont_a = runif(100),
cont_b = runif(100),
dep = runif(100))
cov_a cov_b cont_a cont_b dep
1 0 1 0.238726027 0.784575267 0.9860542973
2 1 0 0.962358936 0.009429905 0.1370674714
3 0 0 0.601365726 0.779065883 0.9053095817
4 1 1 0.515029727 0.729390652 0.5763018376
5 1 0 0.402573342 0.630131853 0.3954488591
6 0 1 0.880246541 0.480910830 0.4498024841
7 1 1 0.364091865 0.156636851 0.7065019011
8 1 1 0.288239281 0.008215520 0.0825027458
9 1 0 0.170645235 0.452458394 0.3393125802
10 0 0 0.172171746 0.492293329 0.6807875512
What I'm looking for is an elegant dplyr
/tidyverse
option to fit a separate regression model for every cov_
variable, while including the same set of additional variables and the same dependent variable.
I'm able to solve this problem using this code (require purrr
, dplyr
, tidyr
and broom
):
map(.x = names(df)[grepl("cov_", names(df))],
~ df %>%
nest() %>%
mutate(res = map(data, function(y) tidy(lm(dep ~ cont_a + cont_b + !!sym(.x), data = y)))) %>%
unnest(res))
[[1]]
# A tibble: 4 x 6
data term estimate std.error statistic p.value
<list> <chr> <dbl> <dbl> <dbl> <dbl>
1 <tibble [100 × 5]> (Intercept) 0.472 0.0812 5.81 0.0000000799
2 <tibble [100 × 5]> cont_a -0.103 0.0983 -1.05 0.296
3 <tibble [100 × 5]> cont_b 0.172 0.0990 1.74 0.0848
4 <tibble [100 × 5]> cov_a -0.0455 0.0581 -0.783 0.436
[[2]]
# A tibble: 4 x 6
data term estimate std.error statistic p.value
<list> <chr> <dbl> <dbl> <dbl> <dbl>
1 <tibble [100 × 5]> (Intercept) 0.415 0.0787 5.27 0.000000846
2 <tibble [100 × 5]> cont_a -0.0874 0.0984 -0.888 0.377
3 <tibble [100 × 5]> cont_b 0.181 0.0980 1.84 0.0682
4 <tibble [100 × 5]> cov_b 0.0482 0.0576 0.837 0.405
However, I would like to avoid the use of double-map()
and solve it by using a somehow more direct or elegant approach.
I'm not sure if this will be considered more direct/elegant but here is my solution that does not use a double map
:
library(tidyverse)
library(broom)
gen_model_expr <- function(var) {
form = paste("dep ~ cont_a + cont_b +", var)
tidy(lm(as.formula(form), data = df))
}
grep("cov_", names(df), value = TRUE) %>%
map(gen_model_expr)
Output (Note that it does not retain the data column):
[[1]]
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.472 0.0812 5.81 0.0000000799
2 cont_a -0.103 0.0983 -1.05 0.296
3 cont_b 0.172 0.0990 1.74 0.0848
4 cov_a -0.0455 0.0581 -0.783 0.436
[[2]]
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.415 0.0787 5.27 0.000000846
2 cont_a -0.0874 0.0984 -0.888 0.377
3 cont_b 0.181 0.0980 1.84 0.0682
4 cov_b 0.0482 0.0576 0.837 0.405
EDIT
In the interest of speed performance (inspired by @TimTeaFan), a benchmark test comparing the different ways to retrieve the covariate names is shown below. grep("cov_", names(df), value = TRUE)
is the fastest
# A tibble: 4 x 13
expression min median `itr/sec` mem_alloc
<bch:expr> <bch:> <bch:> <dbl> <bch:byt>
1 names(df)[grepl("cov_", names(df))] 7.59µs 8.4µs 101975. 0B
2 grep("cov_", colnames(df), value = TRUE) 8.21µs 8.96µs 103142. 0B
3 grep("cov_", names(df), value = TRUE) 6.96µs 7.43µs 128694. 0B
4 df %>% select(starts_with("cov_")) %>% colnames 1.17ms 1.33ms 636. 5.39KB
Not directly tidyverse
-based but: multiple estimations can be performed at once with the package fixest.
The syntax is explicit and the solution to the question asked can be written in one line of code:
library(fixest)
set.seed(123)
n = 100
df = data.frame(cov_a = rbinom(n, 1, prob = 0.5),
cov_b = rbinom(n, 1, prob = 0.5),
cont_a = runif(n), cont_b = runif(n),
dep = runif(n))
# Estimation: sw means stepwise
res = feols(dep ~ sw(cov_a, cov_b) + cont_a + cont_b, df)
That's it: the two estimations are done. (Note that feols
is like lm
.)
Then you can display the results:
# Display the results
etable(res, order = "Int|cov")
#> model 1 model 2
#> Dependent Var.: dep dep
#>
#> (Intercept) 0.4722*** (0.0812) 0.4148*** (0.0788)
#> cov_a -0.0455 (0.0581)
#> cov_b 0.0482 (0.0576)
#> cont_a -0.1033 (0.0983) -0.0874 (0.0984)
#> cont_b 0.1723. (0.0990) 0.1808. (0.0980)
#> _______________ __________________ __________________
#> S.E. type Standard Standard
#> Observations 100 100
#> R2 0.04823 0.04910
#> Adj. R2 0.01849 0.01938
And easily get them to the tidyverse:
# Get the results into tidyverse
library(broom)
lapply(as.list(res), function(x) tidy(x))
#> [[1]]
#> # A tibble: 4 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 0.472 0.0812 5.81 0.0000000799
#> 2 cov_a -0.0455 0.0581 -0.783 0.436
#> 3 cont_a -0.103 0.0983 -1.05 0.296
#> 4 cont_b 0.172 0.0990 1.74 0.0848
#>
#> [[2]]
#> # A tibble: 4 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 0.415 0.0787 5.27 0.000000846
#> 2 cov_b 0.0482 0.0576 0.837 0.405
#> 3 cont_a -0.0874 0.0984 -0.888 0.377
#> 4 cont_b 0.181 0.0980 1.84 0.0682
Here's the documentation on multiple estimations. You may also, still at once, apply multiple estimations to dependent variables/fixed-effects (if appropriate)/split sample--all compactly.
Last thing: multiple estimations are highly optimized, so you'll get high performance gains vis à vis loop based solutions (like map).
I throw in two approaches:
The first one is rather boring. We can use {dplyr}'s rowwise
notation instead of purrr::map
. This approach comes in two flavors. After rowwise
we can use either (A) mutate %>% unnest
or, (B), we can use group_map
. In both approaches I refrained from duplicating the data, but we can easily include the data in each row if needed (when setting up the the tibble
we can do tibble(myvar = ., data = list(df))
). While (A) gives us one tibble that includes all the data, the group_map
in (B) returns a list similar to the "double map" approach from the original example.
The second approach I'd consider as rather "fresh" (although less straight-forward), since it neither uses rowwise
nor map
. Here we use {dplyr}'s across
function in combination with cur_column()
, create a new column for each output and then pivot_longer
and unnest
to have all the results in one tibble
.
The final benchmark shows: "doulbe_map" is slowest (because of the duplicate data column), followed by "across" and "row_unnest", while "row_group_map" is rather fast. Nevertheless, the fastest approach is the one by @latlio using map
and a custom function (bellow called "map_custom_fun"), but although it is using purrr
, it might be less "dplyr-ish".
library(tidyverse)
library(broom)
set.seed(123)
df <- data.frame(cov_a = rbinom(100, 1, prob = 0.5),
cov_b = rbinom(100, 1, prob = 0.5),
cont_a = runif(100),
cont_b = runif(100),
dep = runif(100))
# first approach: using dplyr rowwise
# Variation A:
# rowwise %>% mutate %>% unnest
df %>%
select(starts_with("cov_")) %>%
colnames %>%
tibble(myvar = .) %>%
rowwise() %>%
mutate(res = list(tidy(lm(dep ~ cont_a + cont_b + eval(sym(.data$myvar)), data = df)))) %>%
unnest(res)
#> # A tibble: 8 x 6
#> myvar term estimate std.error statistic p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 cov_a (Intercept) 0.472 0.0812 5.81 0.0000000799
#> 2 cov_a cont_a -0.103 0.0983 -1.05 0.296
#> 3 cov_a cont_b 0.172 0.0990 1.74 0.0848
#> 4 cov_a eval(sym(.data$myvar)) -0.0455 0.0581 -0.783 0.436
#> 5 cov_b (Intercept) 0.415 0.0787 5.27 0.000000846
#> 6 cov_b cont_a -0.0874 0.0984 -0.888 0.377
#> 7 cov_b cont_b 0.181 0.0980 1.84 0.0682
#> 8 cov_b eval(sym(.data$myvar)) 0.0482 0.0576 0.837 0.405
# Variation B:
# rowwise %>% group_map
df %>%
select(starts_with("cov_")) %>%
colnames %>%
tibble(myvar = .) %>%
rowwise() %>%
group_map(.keep = TRUE,
~ tidy(lm(dep ~ cont_a + cont_b + eval(sym(myvar)), data = df)))
#> [[1]]
#> # A tibble: 4 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 0.472 0.0812 5.81 0.0000000799
#> 2 cont_a -0.103 0.0983 -1.05 0.296
#> 3 cont_b 0.172 0.0990 1.74 0.0848
#> 4 eval(sym(.x$myvar)) -0.0455 0.0581 -0.783 0.436
#>
#> [[2]]
#> # A tibble: 4 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 0.415 0.0787 5.27 0.000000846
#> 2 cont_a -0.0874 0.0984 -0.888 0.377
#> 3 cont_b 0.181 0.0980 1.84 0.0682
#> 4 eval(sym(.x$myvar)) 0.0482 0.0576 0.837 0.405
# second approach: using summarise(across)
# we need a `tibble` here, otherwise printing gets messed up
df_tbl <- as_tibble(df)
df_tbl %>%
summarise(across(starts_with("cov_"),
~ list(tidy(lm(
reformulate(c("cont_a","cont_b", cur_column()), "dep"),
data = df_tbl))))) %>%
pivot_longer(cols = everything(),
names_to = "var",
values_to = "res") %>%
unnest(res)
#> # A tibble: 8 x 6
#> var term estimate std.error statistic p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 cov_a (Intercept) 0.472 0.0812 5.81 0.0000000799
#> 2 cov_a cont_a -0.103 0.0983 -1.05 0.296
#> 3 cov_a cont_b 0.172 0.0990 1.74 0.0848
#> 4 cov_a cov_a -0.0455 0.0581 -0.783 0.436
#> 5 cov_b (Intercept) 0.415 0.0787 5.27 0.000000846
#> 6 cov_b cont_a -0.0874 0.0984 -0.888 0.377
#> 7 cov_b cont_b 0.181 0.0980 1.84 0.0682
#> 8 cov_b cov_b 0.0482 0.0576 0.837 0.405
Created on 2021-01-10 by the reprex package (v0.3.0)
Benchmark
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> double_map 20.92847 22.238626 23.645280 22.726705 25.002493 34.29351 100
#> row_unnest 15.30179 15.835506 16.714873 16.358134 17.314802 20.60496 100
#> row_groupmap 10.10168 10.490374 11.237398 10.709524 11.452677 20.40186 100
#> across 16.47369 17.117178 18.593908 17.945136 19.431190 29.29384 100
#> map_custom_fun 6.85758 7.311608 7.953066 7.611394 8.305757 19.57006 100
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With