I'm trying to do some dplyr programming and having trouble. I'd like to <code>group_by</code> an arbitrary number of variables (thus, <code>across</code>), and then <code>summarize</code> based on arbitrary length (but all the same length) vectors of: <ul> <li>The column to apply the function to</li> <li>The function to apply</li> <li>The name of the new column</li> </ul> So, like in a <code>map</code> or <code>apply</code> statement, I want to execute code that ends up looking like: <pre class="prettyprint"><code>data %>% group_by(group_column) %>% summarize(new_name_1 = function_1(column_1), summarize(new_name_2 = function_2(column_2)) </code></pre> Here's an example of what I want and my best shot so far. I know I can use the <code>names</code> argument to clean those up if I use across, but I'm not confident that across is the correct way. Finally, I'll be applying this to fairly large dataframes, so I'd rather not calculate the extra columns. Desired result <pre class="prettyprint"><code>mtcars %>% group_by(across(c("cyl", "carb"))) %>% summarise(across(c("disp", "hp"), list(mean = mean, sd = sd))) %>% select(cyl, carb, disp_mean, hp_sd) #> `summarise()` regrouping output by 'cyl' (override with `.groups` argument) #> # A tibble: 9 x 4 #> # Groups: cyl [3] #> cyl carb disp_mean hp_sd #> <dbl> <dbl> <dbl> <dbl> #> 1 4 1 91.4 16.1 #> 2 4 2 117. 24.9 #> 3 6 1 242. 3.54 #> 4 6 4 164. 7.51 #> 5 6 6 145 NA #> 6 8 2 346. 14.4 #> 7 8 3 276. 0 #> 8 8 4 406. 21.7 #> 9 8 8 301 NA </code></pre> What I get <pre class="prettyprint"><code>mtcars %>% group_by(across(c("cyl", "carb"))) %>% summarise(across(c("disp", "hp"), list(mean = mean, sd = sd))) #> `summarise()` regrouping output by 'cyl' (override with `.groups` argument) #> # A tibble: 9 x 6 #> # Groups: cyl [3] #> cyl carb disp_mean disp_sd hp_mean hp_sd #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 4 1 91.4 21.4 77.4 16.1 #> 2 4 2 117. 27.1 87 24.9 #> 3 6 1 242. 23.3 108. 3.54 #> 4 6 4 164. 4.39 116. 7.51 #> 5 6 6 145 NA 175 NA #> 6 8 2 346. 43.4 162. 14.4 #> 7 8 3 276. 0 180 0 #> 8 8 4 406. 57.8 234 21.7 #> 9 8 8 301 NA 335 NA </code></pre>

With different functions on different columns, an option is to use <code>collap</code> from <code>collapse</code> <pre class="prettyprint"><code>library(collapse) collap(mtcars, ~ cyl + carb, custom = list(fmean = 4, fsd = 5)) </code></pre> -output <pre class="prettyprint"><code>cyl disp hp carb 1 4 91.38 16.133815 1 2 4 116.60 24.859606 2 3 6 241.50 3.535534 1 4 6 163.80 7.505553 4 5 6 145.00 NA 6 6 8 345.50 14.433757 2 7 8 275.80 0.000000 3 8 8 405.50 21.725561 4 9 8 301.00 NA 8 </code></pre> <hr> Or the index can be dynamically generated with <code>match</code> <pre class="prettyprint"><code>collap(mtcars, ~ cyl + carb, custom = list(fmean = match('disp', names(mtcars)), fsd = match('hp', names(mtcars)))) </code></pre> <hr> With <code>tidyverse</code>, an option is to loop over the column names of interest and the functions in <code>map2</code> and do a join later <pre class="prettyprint"><code>library(dplyr) library(purrr) library(stringr) map2(c("disp", "hp"), c("mean", "sd"), ~ mtcars %>% group_by(across(c('cyl', 'carb'))) %>% summarise(across(all_of(.x), match.fun(.y), .names = str_c("{.col}_", .y)), .groups = 'drop')) %>% reduce(inner_join) </code></pre> -output <pre class="prettyprint"><code># A tibble: 9 x 4 cyl carb disp_mean hp_sd <dbl> <dbl> <dbl> <dbl> 1 4 1 91.4 16.1 2 4 2 117. 24.9 3 6 1 242. 3.54 4 6 4 164. 7.51 5 6 6 145 NA 6 8 2 346. 14.4 7 8 3 276. 0 8 8 4 406. 21.7 9 8 8 301 NA </code></pre>

I have a package on github {dplyover} which can help with this kind of tasks. In this case we could use <code>over2</code> to loop over two character vectors simultaniously. The first vector contains the variable names as string, which is why we have to wrap <code>.x</code> in <code>sym()</code> when applying a function to it. The second vector contains the function names, which we use as <code>.y</code> in a <code>do.call</code>. <code>over2</code> creates the desired names automatically. <pre class="prettyprint lang-r prettyprint-override"><code>library(dplyr) library(dplyover) # https://github.com/TimTeaFan/dplyover mtcars %>% group_by(across(c("cyl", "carb"))) %>% summarise(over2(c("disp", "hp"), c("mean", "sd"), ~ do.call(.y, list(sym(.x))) )) #> `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument. #> # A tibble: 9 x 4 #> # Groups: cyl [3] #> cyl carb disp_mean hp_sd #> <dbl> <dbl> <dbl> <dbl> #> 1 4 1 91.4 16.1 #> 2 4 2 117. 24.9 #> 3 6 1 242. 3.54 #> 4 6 4 164. 7.51 #> 5 6 6 145 NA #> 6 8 2 346. 14.4 #> 7 8 3 276. 0 #> 8 8 4 406. 21.7 #> 9 8 8 301 NA </code></pre> An alternative way building on the same logic is to use <code>purrr::map2</code>. However, here we have to put some effort into creating vectors with the desired names. <pre class="prettyprint lang-r prettyprint-override"><code>library(purrr) # setup vectors and names myfuns <- c("mean", "sd") myvars <- c("disp", "hp") %>% set_names(., paste(., myfuns, sep = "_")) mtcars %>% group_by(across(c("cyl", "carb"))) %>% summarise(map2(myvars, myfuns, ~ do.call(.y, list(sym(.x))) ) %>% bind_cols() ) #> `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument. #> # A tibble: 9 x 4 #> # Groups: cyl [3] #> cyl carb disp_mean hp_sd #> <dbl> <dbl> <dbl> <dbl> #> 1 4 1 91.4 16.1 #> 2 4 2 117. 24.9 #> 3 6 1 242. 3.54 #> 4 6 4 164. 7.51 #> 5 6 6 145 NA #> 6 8 2 346. 14.4 #> 7 8 3 276. 0 #> 8 8 4 406. 21.7 #> 9 8 8 301 NA </code></pre> Created on 2021-08-20 by the reprex package (v2.0.1)

How do I build a dplyr summarize statement programmatically?

Tags:

r

dplyr

I'm trying to do some dplyr programming and having trouble. I'd like to group_by an arbitrary number of variables (thus, across), and then summarize based on arbitrary length (but all the same length) vectors of:

The column to apply the function to
The function to apply
The name of the new column

So, like in a map or apply statement, I want to execute code that ends up looking like:

data %>%
  group_by(group_column) %>%
  summarize(new_name_1 = function_1(column_1),
  summarize(new_name_2 = function_2(column_2))

Here's an example of what I want and my best shot so far. I know I can use the names argument to clean those up if I use across, but I'm not confident that across is the correct way. Finally, I'll be applying this to fairly large dataframes, so I'd rather not calculate the extra columns.

Desired result

mtcars %>%
  group_by(across(c("cyl", "carb"))) %>%
  summarise(across(c("disp", "hp"), list(mean = mean, sd = sd))) %>%
  select(cyl, carb, disp_mean, hp_sd)
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 9 x 4
#> # Groups:   cyl [3]
#>     cyl  carb disp_mean hp_sd
#>   <dbl> <dbl>     <dbl> <dbl>
#> 1     4     1      91.4 16.1 
#> 2     4     2     117.  24.9 
#> 3     6     1     242.   3.54
#> 4     6     4     164.   7.51
#> 5     6     6     145   NA   
#> 6     8     2     346.  14.4 
#> 7     8     3     276.   0   
#> 8     8     4     406.  21.7 
#> 9     8     8     301   NA

What I get

mtcars %>%
  group_by(across(c("cyl", "carb"))) %>%
  summarise(across(c("disp", "hp"), list(mean = mean, sd = sd)))
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 9 x 6
#> # Groups:   cyl [3]
#>     cyl  carb disp_mean disp_sd hp_mean hp_sd
#>   <dbl> <dbl>     <dbl>   <dbl>   <dbl> <dbl>
#> 1     4     1      91.4   21.4     77.4 16.1 
#> 2     4     2     117.    27.1     87   24.9 
#> 3     6     1     242.    23.3    108.   3.54
#> 4     6     4     164.     4.39   116.   7.51
#> 5     6     6     145     NA      175   NA   
#> 6     8     2     346.    43.4    162.  14.4 
#> 7     8     3     276.     0      180    0   
#> 8     8     4     406.    57.8    234   21.7 
#> 9     8     8     301     NA      335   NA

996

asked Aug 19 '21 19:08

spillway18

Video Answer

2 Answers

With different functions on different columns, an option is to use collap from collapse

library(collapse)
collap(mtcars, ~ cyl + carb, custom = list(fmean = 4, fsd = 5))

-output

cyl   disp        hp carb
1   4  91.38 16.133815    1
2   4 116.60 24.859606    2
3   6 241.50  3.535534    1
4   6 163.80  7.505553    4
5   6 145.00        NA    6
6   8 345.50 14.433757    2
7   8 275.80  0.000000    3
8   8 405.50 21.725561    4
9   8 301.00        NA    8

Or the index can be dynamically generated with match

collap(mtcars, ~ cyl + carb, custom = list(fmean =
   match('disp', names(mtcars)), fsd = match('hp', names(mtcars))))

With tidyverse, an option is to loop over the column names of interest and the functions in map2 and do a join later

library(dplyr)
library(purrr)
library(stringr)
map2(c("disp", "hp"), c("mean", "sd"), ~
   mtcars %>%
      group_by(across(c('cyl', 'carb'))) %>% 
      summarise(across(all_of(.x), match.fun(.y), 
         .names = str_c("{.col}_", .y)), .groups = 'drop')) %>% 
    reduce(inner_join)

-output

# A tibble: 9 x 4
    cyl  carb disp_mean hp_sd
  <dbl> <dbl>     <dbl> <dbl>
1     4     1      91.4 16.1 
2     4     2     117.  24.9 
3     6     1     242.   3.54
4     6     4     164.   7.51
5     6     6     145   NA   
6     8     2     346.  14.4 
7     8     3     276.   0   
8     8     4     406.  21.7 
9     8     8     301   NA

164

answered Oct 24 '22 04:10

akrun

I have a package on github {dplyover}

which can help with this kind of tasks. In this case we could use over2 to loop over two character vectors simultaniously. The first vector contains the variable names as string, which is why we have to wrap .x in sym() when applying a function to it. The second vector contains the function names, which we use as .y in a do.call. over2 creates the desired names automatically.

library(dplyr)
library(dplyover) # https://github.com/TimTeaFan/dplyover

mtcars %>%
  group_by(across(c("cyl", "carb"))) %>%
  summarise(over2(c("disp", "hp"),
                  c("mean", "sd"),
                  ~ do.call(.y, list(sym(.x)))
                  ))

#> `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
#> # A tibble: 9 x 4
#> # Groups:   cyl [3]
#>     cyl  carb disp_mean hp_sd
#>   <dbl> <dbl>     <dbl> <dbl>
#> 1     4     1      91.4 16.1 
#> 2     4     2     117.  24.9 
#> 3     6     1     242.   3.54
#> 4     6     4     164.   7.51
#> 5     6     6     145   NA   
#> 6     8     2     346.  14.4 
#> 7     8     3     276.   0   
#> 8     8     4     406.  21.7 
#> 9     8     8     301   NA

An alternative way building on the same logic is to use purrr::map2. However, here we have to put some effort into creating vectors with the desired names.

library(purrr)

# setup vectors and names
myfuns <- c("mean", "sd")
myvars <- c("disp", "hp") %>%
  set_names(., paste(., myfuns, sep = "_"))

mtcars %>%
  group_by(across(c("cyl", "carb"))) %>%
  summarise(map2(myvars,
                 myfuns,
                 ~ do.call(.y, list(sym(.x)))
                 ) %>% bind_cols()
  )

#> `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
#> # A tibble: 9 x 4
#> # Groups:   cyl [3]
#>     cyl  carb disp_mean hp_sd
#>   <dbl> <dbl>     <dbl> <dbl>
#> 1     4     1      91.4 16.1 
#> 2     4     2     117.  24.9 
#> 3     6     1     242.   3.54
#> 4     6     4     164.   7.51
#> 5     6     6     145   NA   
#> 6     8     2     346.  14.4 
#> 7     8     3     276.   0   
#> 8     8     4     406.  21.7 
#> 9     8     8     301   NA

^{Created on 2021-08-20 by the reprex package (v2.0.1)}

answered Oct 24 '22 04:10

TimTeaFan

Related questions
                            
                                How to change the position of the zoomed area from facet_zoom()?
                            
                                ">" is not matched by "[[:punct:]]" when using `stringr::str_replace_all`? [duplicate]
                            
                                Last line of csv file is not read by fread from package data.table with error message 'Discarded single-line footer'
                            
                                What exactly is the z argument in plot_ly?
                            
                                Add number in the string after each letter
                            
                                Plotting data in R; error: default method not implemented for type 'list'
                            
                                Faster for loop
                            
                                Plotting lines between two sf POINT features in r
                            
                                Clicking a leaflet marker takes you to URL
                            
                                Dodged dumbbell plots with ggplot2
                            
                                How to expand/collapse the shiny dashboard sidebar on mouse hover?
                            
                                Creating sf polygons from a dataframe
                            
                                Remove one row from dataframe in R
                            
                                Is there a way to embed a ggplot image dynamically by row (like a sparkline) using the gt package?
                            
                                Error: Input files not all in same directory, please supply explicit wd
                            
                                Is there a way to use latex expression of chemarr for `gitbook` format of bookdown package?
                            
                                How to repeatedly generate non-repeating smaller groups from a larger set
                            
                                R: all possible combinations from a vector of elements with 2 possible conditions (+/-)
                            
                                Remove columns that have only a unique value
                            
                                R arrow: Error: Support for codec 'snappy' not built

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With