The answer to this question clearly explains how to retrieve tidy regression results by group when running a regression through a dplyr pipe, but the solution is no longer reproducible.
How can one use dplyr and broom in combination to run a regression by group and retrieve tidy results using R 4.02, dplyr 1.0.0, and broom 0.7.0?
Specifically, the example answer from the question linked above,
library(dplyr)
library(broom)
df.h = data.frame(
hour = factor(rep(1:24, each = 21)),
price = runif(504, min = -10, max = 125),
wind = runif(504, min = 0, max = 2500),
temp = runif(504, min = - 10, max = 25)
)
dfHour = df.h %>% group_by(hour) %>%
do(fitHour = lm(price ~ wind + temp, data = .))
# get the coefficients by group in a tidy data_frame
dfHourCoef = tidy(dfHour, fitHour)
returns the following error (and three warnings) when I run it on my system:
Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) :
Calling var(x) on a factor x is defunct.
Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.
In addition: Warning messages:
1: Data frame tidiers are deprecated and will be removed in an upcoming release of broom.
2: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
3: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
If I reformat df.h$hour
as a character rather than factor,
df.h <- df.h %>%
mutate(
hour = as.character(hour)
)
re-run the regression by group, and again attempt to retrieve the results using broom::tidy
,
dfHour = df.h %>% group_by(hour) %>%
do(fitHour = lm(price ~ wind + temp, data = .))
# get the coefficients by group in a tidy data_frame
dfHourCoef = tidy(dfHour, fitHour)
I get this error:
Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) :
is.atomic(x) is not TRUE
I assume that the problem has to do with the fact that the group-level regression results are stored as lists in dfHour$fitHour
, but I am unsure how to correct the error and once again tidily and quickly compile the regression results, as used to work in the originally posted code/answer.
****** Updated with more succinct code pulled from the dplyr 1.0.0 release notes ******
Thank you. I was struggling with a similar question with the update to dplyr 1.0.0 related to using the examples in the provided link. This was both a helpful question and answer.
One note as an FYI, do() has been superseded as of dplyr 1.0.0, so may consider using the updated language (now very efficient with my update):
dfHour = df.h %>%
# replace group_by() with nest_by()
# to convert your model data to a vector of lists
nest_by(hour) %>%
# change do() to mutate(), then add list() before your model
# make sure to change data = . to data = data
mutate(fitHour = list(lm(price ~ wind + temp, data = data))) %>%
summarise(tidy(mod))
Done!
This gives a very efficient df with select output stats. The last line replaces the following code (from my original response), which does the same thing, but less easily:
ungroup() %>%
# then leverage the feedback from @akrun
transmute(hour, HourCoef = map(fitHour, tidy)) %>%
unnest(HourCoef)
dfHour
Which gives the outupt:
# A tibble: 72 x 6
hour term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 (Intercept) 68.6 21.0 3.27 0.00428
2 1 wind 0.000558 0.0124 0.0450 0.965
3 1 temp -0.866 0.907 -0.954 0.353
4 2 (Intercept) 31.9 17.4 1.83 0.0832
5 2 wind 0.00950 0.0113 0.838 0.413
6 2 temp 1.69 0.802 2.11 0.0490
7 3 (Intercept) 85.5 22.3 3.83 0.00122
8 3 wind -0.0210 0.0165 -1.27 0.220
9 3 temp 0.276 1.14 0.243 0.811
10 4 (Intercept) 73.3 15.1 4.86 0.000126
# ... with 62 more rows
Thanks for the patience, I am working through this myself!
Issue would be that there is a grouping attribute rowwise
after the do
call and the column 'fitHour' is a list
. We can ungroup
, loop over the list
with map
and tidy
it to a list
column
library(dplyr)
library(purrr)
library(broom)
df.h %>%
group_by(hour) %>%
do(fitHour = lm(price ~ wind + temp, data = .)) %>%
ungroup %>%
mutate(HourCoef = map(fitHour, tidy))
Or use unnest
after the mtuate
df.h %>%
group_by(hour) %>%
do(fitHour = lm(price ~ wind + temp, data = .)) %>%
ungroup %>%
transmute(hour, HourCoef = map(fitHour, tidy)) %>%
unnest(HourCoef)
# A tibble: 72 x 6
# hour term estimate std.error statistic p.value
# <fct> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 (Intercept) 89.8 20.2 4.45 0.000308
# 2 1 wind 0.00493 0.0151 0.326 0.748
# 3 1 temp -1.84 1.08 -1.71 0.105
# 4 2 (Intercept) 75.6 23.7 3.20 0.00500
# 5 2 wind -0.00910 0.0146 -0.622 0.542
# 6 2 temp 0.192 0.853 0.225 0.824
# 7 3 (Intercept) 44.0 23.9 1.84 0.0822
# 8 3 wind -0.00158 0.0166 -0.0953 0.925
# 9 3 temp 0.622 1.19 0.520 0.609
#10 4 (Intercept) 57.8 18.9 3.06 0.00676
# … with 62 more rows
If we wanted a single dataset, pull
the 'fitHour', loop over the list
with map
, condense it to a single dataset by row binding (suffix _dfr
)
df.h %>%
group_by(hour) %>%
do(fitHour = lm(price ~ wind + temp, data = .)) %>%
ungroup %>%
pull(fitHour) %>%
map_dfr(tidy, .id = 'grp')
NOTE: The OP's error message was able to be replicated with R 4.02
, dplyr 1.0.0
and broom 0.7.0
tidy(dfHour,fitHour)
Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) : Calling var(x) on a factor x is defunct. Use something like 'all(duplicated(x)[-1L])' to test for a constant vector. In addition: Warning messages: 1: Data frame tidiers are deprecated and will be removed in an upcoming release of broom. 2: In mean.default(X[[i]], ...) :
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With