I'm following the Many Models example from the R for Data Science book to generate individual multiple linear regression models. As illustrated in the reproducible example below, my data frame consists of three columns: id, val1, val2, val3,
. I am able to fit linear models using the map
function as detailed in Hadley's book. However, I've been struggling with extracting the coefficients from each of the models and adding these values as columns in the dataframes in my.list
object. The way the model coefficients are currently stored, makes it difficult/cumbersome to call in other parts of my code.
The best I've come up with so far is to make a list the length of my.list
and extract coefficients by iterating over each data frame in my.list
: Name1, Name2, Name3
. This means that now I have another list in my global environment and the coef.list
no longer contains factors Name1, Name2, Name3
from my.list
; these have now been replaced with [[1]], [[2]], [[3]]
.
Can anyone suggest a "cleaner" way of extracting model coefficients when working with multiple models? My preferred output would simply be to create a column for each coefficient: intercept, val1, val2
. These columns would go in the existing data frames Name1, Name2, Name3
within my.list
so that I can use mutate
directly on the data frames:
# reproducible example
set.seed(1363)
d1 <- data.frame(id=c("Name1", "Name2", "Name3"),
val1=c(rnorm(n=15, mean=5)),
val2=c(rnorm(n=15, mean=3)),
val3=c(rnorm(n=15, mean=8)))
# linear model function
lm.fun <- function(df){
lm(val3 ~ val1+val2, data = df)
}
# map lm function
d1 <- d1 %>%
group_by(id) %>%
nest() %>%
mutate(model = map(data, lm.fun)) %>%
unnest(data, .drop = FALSE)
#split data frame by 'id' and convert into list
my.list <- split(d1, d1$id)
# make list of coefficients
coef.list <- list(length(my.list))
for (i in seq_along(my.list)) {
coef.list[[i]] <- my.list[[i]][["model"]][[1]][["coefficients"]]
}
>head(coef.list, n=1)
[[1]]
(Intercept) val1 val2
9.03278337 -0.07096932 0.02119088
DESIRED OUTPUT
my.list$Name1
id val1 val2 val3 intc coef1 coef2
Name1 1 2 3 9.03 -.070 .021
Name1 3 1 5 9.03 -.070 .021
Name1 2 6 8 9.03 -.070 .021
Based on the description, if we need to create some columns with the coefficients, one option is using do
with group_by
. After grouping by 'id', extract the coefficients as a list
within do
and rename
the columns (if needed)
library(tidyverse)
d1 %>%
group_by(id) %>%
do(data.frame(., as.list(coef(lm(val3 ~ val1 + val2, data = .))))) %>%
rename_at(5:7, ~c("intc", "coef1", "coef2"))
# A tibble: 15 x 7
# Groups: id [3]
# id val1 val2 val3 intc coef1 coef2
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Name1 6.76 2.64 9.85 9.03 -0.0710 0.0212
# 2 Name1 6.78 1.52 6.94 9.03 -0.0710 0.0212
# 3 Name1 4.14 4.31 8.30 9.03 -0.0710 0.0212
# 4 Name1 5.55 2.16 9.97 9.03 -0.0710 0.0212
# 5 Name1 6.18 3.64 8.32 9.03 -0.0710 0.0212
# 6 Name2 6.08 1.12 9.96 7.33 0.468 -0.488
# 7 Name2 3.54 4.71 6.43 7.33 0.468 -0.488
# 8 Name2 5.66 4.54 8.69 7.33 0.468 -0.488
# 9 Name2 6.88 4.15 7.79 7.33 0.468 -0.488
#10 Name2 4.89 1.27 8.72 7.33 0.468 -0.488
#11 Name3 6.41 4.38 6.22 20.1 -2.15 0.118
#12 Name3 5.06 3.28 9.42 20.1 -2.15 0.118
#13 Name3 6.25 3.16 8.15 20.1 -2.15 0.118
#14 Name3 6.03 3.63 7.78 20.1 -2.15 0.118
#15 Name3 6.51 1.46 5.90 20.1 -2.15 0.118
Or we can use broom
package functions. Functions like tidy
, glance
extract columns of interest (and more). After nest
ing the grouped dataset, build the model, extract the components with tidy
, select
the columns of interest and unnest
library(broom)
d1 %>%
group_by(id) %>%
nest %>%
mutate(model1 = map(data, ~
lm(val3 ~ val1 + val2, data = .) %>%
tidy %>%
dplyr::select(term, estimate) %>%
spread(term, estimate) %>%
rename_all(~ c("intc", paste0("coef", 1:2))))) %>%
unnest(model1) %>%
unnest(data)
# A tibble: 15 x 7
# id intc coef1 coef2 val1 val2 val3
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Name1 9.03 -0.0710 0.0212 6.76 2.64 9.85
# 2 Name1 9.03 -0.0710 0.0212 6.78 1.52 6.94
# 3 Name1 9.03 -0.0710 0.0212 4.14 4.31 8.30
# 4 Name1 9.03 -0.0710 0.0212 5.55 2.16 9.97
# 5 Name1 9.03 -0.0710 0.0212 6.18 3.64 8.32
# 6 Name2 7.33 0.468 -0.488 6.08 1.12 9.96
# 7 Name2 7.33 0.468 -0.488 3.54 4.71 6.43
# 8 Name2 7.33 0.468 -0.488 5.66 4.54 8.69
# 9 Name2 7.33 0.468 -0.488 6.88 4.15 7.79
#10 Name2 7.33 0.468 -0.488 4.89 1.27 8.72
#11 Name3 20.1 -2.15 0.118 6.41 4.38 6.22
#12 Name3 20.1 -2.15 0.118 5.06 3.28 9.42
#13 Name3 20.1 -2.15 0.118 6.25 3.16 8.15
#14 Name3 20.1 -2.15 0.118 6.03 3.63 7.78
#15 Name3 20.1 -2.15 0.118 6.51 1.46 5.90
Or without using broom
d1 %>%
group_by(id) %>%
nest %>%
mutate(model = map(data, ~ lm(val3 ~ val1 + val2, data = .) %>%
coef %>%
as.list %>%
as_tibble)) %>%
unnest(model) %>%
unnest(data)
If we only need a summarised output for each 'id'
d1 %>%
group_by(id) %>%
nest %>%
mutate(model1 = map(data, ~
lm(val3 ~ val1 + val2, data = .) %>%
tidy %>%
dplyr::select(term, estimate) %>%
spread(term, estimate) %>%
rename_all(~ c("intc", paste0("coef", 1:2))))) %>%
dplyr::select(-data) %>%
unnest
Or with data.table
, we can do this more concisely, after grouping by 'id' and assigning (:=
) the new columns by extracting the coef
of the model as a list
library(data.table)
setDT(d1)[, c('intc', 'coef1', 'coef2') :=
as.list(coef(lm(val3 ~ val1 + val2))), id]
d1[order(id)]
# id val1 val2 val3 intc coef1 coef2
# 1: Name1 6.755964 2.642874 9.849828 9.032783 -0.07096932 0.02119088
# 2: Name1 6.776666 1.522431 6.937053 9.032783 -0.07096932 0.02119088
# 3: Name1 4.141883 4.307537 8.301940 9.032783 -0.07096932 0.02119088
# 4: Name1 5.551850 2.163882 9.971588 9.032783 -0.07096932 0.02119088
# 5: Name1 6.179506 3.635832 8.319042 9.032783 -0.07096932 0.02119088
# 6: Name2 6.083243 1.116293 9.960934 7.325156 0.46840770 -0.48806159
# 7: Name2 3.536476 4.708967 6.427627 7.325156 0.46840770 -0.48806159
# 8: Name2 5.663909 4.541081 8.691523 7.325156 0.46840770 -0.48806159
# 9: Name2 6.883746 4.150780 7.791050 7.325156 0.46840770 -0.48806159
#10: Name2 4.890291 1.269559 8.723792 7.325156 0.46840770 -0.48806159
#11: Name3 6.414915 4.383609 6.220188 20.106581 -2.14601530 0.11770877
#12: Name3 5.059774 3.276510 9.421862 20.106581 -2.14601530 0.11770877
#13: Name3 6.251416 3.157157 8.147720 20.106581 -2.14601530 0.11770877
#14: Name3 6.028100 3.630858 7.783118 20.106581 -2.14601530 0.11770877
#15: Name3 6.505153 1.460564 5.895564 20.106581 -2.14601530 0.11770877
Or do a join
if we don't want to update the initial data
setDT(d1)[, as.list(coef(lm(val3 ~ val1 + val2))), id][d1, on = .(id)]
NOTE: 'd1' is the initial dataset
Using sapply
, given the data d1
(containing the model
column):
coeff <- sapply(d1$model, function(x) return(coefficients(x)))
library(dplyr)
bind_cols(d1, data.frame(t(coeff))) %>%
rename_at(6:8, ~ c("intc", "coef1", "coef2")) %>%
distinct(id, .keep_all = TRUE)
library(tidyverse)
d1%>%
group_by(id)%>%
mutate(s=toString(coef(lm(val3~val1+val2))))%>%
separate(s,c("intercept","coef1","coef2"),sep=",",convert = T)%>%
arrange(id)
# A tibble: 15 x 7
# Groups: id [3]
id val1 val2 val3 intercept coef1 coef2
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Name1 6.76 2.64 9.85 9.03 -0.0710 0.0212
2 Name1 6.78 1.52 6.94 9.03 -0.0710 0.0212
3 Name1 4.14 4.31 8.30 9.03 -0.0710 0.0212
4 Name1 5.55 2.16 9.97 9.03 -0.0710 0.0212
5 Name1 6.18 3.64 8.32 9.03 -0.0710 0.0212
6 Name2 6.08 1.12 9.96 7.33 0.468 -0.488
7 Name2 3.54 4.71 6.43 7.33 0.468 -0.488
8 Name2 5.66 4.54 8.69 7.33 0.468 -0.488
9 Name2 6.88 4.15 7.79 7.33 0.468 -0.488
10 Name2 4.89 1.27 8.72 7.33 0.468 -0.488
11 Name3 6.41 4.38 6.22 20.1 -2.15 0.118
12 Name3 5.06 3.28 9.42 20.1 -2.15 0.118
13 Name3 6.25 3.16 8.15 20.1 -2.15 0.118
14 Name3 6.03 3.63 7.78 20.1 -2.15 0.118
15 Name3 6.51 1.46 5.90 20.1 -2.15 0.118
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With