Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

retreiving tidy results from regression by group with broom

The answer to this question clearly explains how to retrieve tidy regression results by group when running a regression through a dplyr pipe, but the solution is no longer reproducible.

How can one use dplyr and broom in combination to run a regression by group and retrieve tidy results using R 4.02, dplyr 1.0.0, and broom 0.7.0?

Specifically, the example answer from the question linked above,

library(dplyr)
library(broom)

df.h = data.frame( 
  hour     = factor(rep(1:24, each = 21)),
  price    = runif(504, min = -10, max = 125),
  wind     = runif(504, min = 0, max = 2500),
  temp     = runif(504, min = - 10, max = 25)  
)

dfHour = df.h %>% group_by(hour) %>%
  do(fitHour = lm(price ~ wind + temp, data = .))

# get the coefficients by group in a tidy data_frame
dfHourCoef = tidy(dfHour, fitHour)

returns the following error (and three warnings) when I run it on my system:

Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) : 
  Calling var(x) on a factor x is defunct.
  Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.
In addition: Warning messages:
1: Data frame tidiers are deprecated and will be removed in an upcoming release of broom. 
2: In mean.default(X[[i]], ...) :
  argument is not numeric or logical: returning NA
3: In mean.default(X[[i]], ...) :
  argument is not numeric or logical: returning NA

If I reformat df.h$hour as a character rather than factor,

df.h <- df.h %>%
  mutate(
    hour = as.character(hour)
  )

re-run the regression by group, and again attempt to retrieve the results using broom::tidy,

dfHour = df.h %>% group_by(hour) %>%
  do(fitHour = lm(price ~ wind + temp, data = .))

# get the coefficients by group in a tidy data_frame
dfHourCoef = tidy(dfHour, fitHour)

I get this error:

Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) : 
  is.atomic(x) is not TRUE

I assume that the problem has to do with the fact that the group-level regression results are stored as lists in dfHour$fitHour, but I am unsure how to correct the error and once again tidily and quickly compile the regression results, as used to work in the originally posted code/answer.

like image 808
C. Rea Avatar asked Feb 04 '23 14:02

C. Rea


2 Answers

****** Updated with more succinct code pulled from the dplyr 1.0.0 release notes ******

Thank you. I was struggling with a similar question with the update to dplyr 1.0.0 related to using the examples in the provided link. This was both a helpful question and answer.

One note as an FYI, do() has been superseded as of dplyr 1.0.0, so may consider using the updated language (now very efficient with my update):

dfHour = df.h %>% 
  # replace group_by() with nest_by() 
  # to convert your model data to a vector of lists
  nest_by(hour) %>%
  # change do() to mutate(), then add list() before your model
  # make sure to change data = .  to data = data
  mutate(fitHour = list(lm(price ~ wind + temp, data = data))) %>%
  summarise(tidy(mod))

Done!

This gives a very efficient df with select output stats. The last line replaces the following code (from my original response), which does the same thing, but less easily:

ungroup() %>%
  # then leverage the feedback from @akrun
  transmute(hour, HourCoef = map(fitHour, tidy)) %>%
  unnest(HourCoef)

dfHour

Which gives the outupt:

# A tibble: 72 x 6
   hour  term         estimate std.error statistic  p.value
   <fct> <chr>           <dbl>     <dbl>     <dbl>    <dbl>
 1 1     (Intercept) 68.6        21.0       3.27   0.00428 
 2 1     wind         0.000558    0.0124    0.0450 0.965   
 3 1     temp        -0.866       0.907    -0.954  0.353   
 4 2     (Intercept) 31.9        17.4       1.83   0.0832  
 5 2     wind         0.00950     0.0113    0.838  0.413   
 6 2     temp         1.69        0.802     2.11   0.0490  
 7 3     (Intercept) 85.5        22.3       3.83   0.00122 
 8 3     wind        -0.0210      0.0165   -1.27   0.220   
 9 3     temp         0.276       1.14      0.243  0.811   
10 4     (Intercept) 73.3        15.1       4.86   0.000126
# ... with 62 more rows

Thanks for the patience, I am working through this myself!

like image 79
glenn_in_boston Avatar answered Mar 24 '23 22:03

glenn_in_boston


Issue would be that there is a grouping attribute rowwise after the do call and the column 'fitHour' is a list. We can ungroup, loop over the list with map and tidy it to a list column

library(dplyr)
library(purrr)
library(broom)
df.h %>% 
     group_by(hour) %>%
     do(fitHour = lm(price ~ wind + temp, data = .)) %>% 
     ungroup %>% 
     mutate(HourCoef = map(fitHour, tidy))

Or use unnest after the mtuate

df.h %>% 
      group_by(hour) %>%
      do(fitHour = lm(price ~ wind + temp, data = .)) %>% 
      ungroup %>% 
      transmute(hour, HourCoef = map(fitHour, tidy)) %>% 
      unnest(HourCoef)
# A tibble: 72 x 6
#   hour  term        estimate std.error statistic  p.value
#   <fct> <chr>          <dbl>     <dbl>     <dbl>    <dbl>
# 1 1     (Intercept) 89.8       20.2       4.45   0.000308
# 2 1     wind         0.00493    0.0151    0.326  0.748   
# 3 1     temp        -1.84       1.08     -1.71   0.105   
# 4 2     (Intercept) 75.6       23.7       3.20   0.00500 
# 5 2     wind        -0.00910    0.0146   -0.622  0.542   
# 6 2     temp         0.192      0.853     0.225  0.824   
# 7 3     (Intercept) 44.0       23.9       1.84   0.0822  
# 8 3     wind        -0.00158    0.0166   -0.0953 0.925   
# 9 3     temp         0.622      1.19      0.520  0.609   
#10 4     (Intercept) 57.8       18.9       3.06   0.00676 
# … with 62 more rows

If we wanted a single dataset, pull the 'fitHour', loop over the list with map, condense it to a single dataset by row binding (suffix _dfr)

df.h %>%
    group_by(hour) %>%  
    do(fitHour = lm(price ~ wind + temp, data = .)) %>% 
    ungroup %>% 
    pull(fitHour) %>% 
    map_dfr(tidy, .id = 'grp')

NOTE: The OP's error message was able to be replicated with R 4.02, dplyr 1.0.0 and broom 0.7.0

tidy(dfHour,fitHour)

Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) : Calling var(x) on a factor x is defunct. Use something like 'all(duplicated(x)[-1L])' to test for a constant vector. In addition: Warning messages: 1: Data frame tidiers are deprecated and will be removed in an upcoming release of broom. 2: In mean.default(X[[i]], ...) :

like image 29
akrun Avatar answered Mar 24 '23 22:03

akrun