Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I build a dplyr summarize statement programmatically?

Tags:

r

dplyr

I'm trying to do some dplyr programming and having trouble. I'd like to group_by an arbitrary number of variables (thus, across), and then summarize based on arbitrary length (but all the same length) vectors of:

  • The column to apply the function to
  • The function to apply
  • The name of the new column

So, like in a map or apply statement, I want to execute code that ends up looking like:

data %>%
  group_by(group_column) %>%
  summarize(new_name_1 = function_1(column_1),
  summarize(new_name_2 = function_2(column_2))

Here's an example of what I want and my best shot so far. I know I can use the names argument to clean those up if I use across, but I'm not confident that across is the correct way. Finally, I'll be applying this to fairly large dataframes, so I'd rather not calculate the extra columns.

Desired result

mtcars %>%
  group_by(across(c("cyl", "carb"))) %>%
  summarise(across(c("disp", "hp"), list(mean = mean, sd = sd))) %>%
  select(cyl, carb, disp_mean, hp_sd)
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 9 x 4
#> # Groups:   cyl [3]
#>     cyl  carb disp_mean hp_sd
#>   <dbl> <dbl>     <dbl> <dbl>
#> 1     4     1      91.4 16.1 
#> 2     4     2     117.  24.9 
#> 3     6     1     242.   3.54
#> 4     6     4     164.   7.51
#> 5     6     6     145   NA   
#> 6     8     2     346.  14.4 
#> 7     8     3     276.   0   
#> 8     8     4     406.  21.7 
#> 9     8     8     301   NA

What I get

mtcars %>%
  group_by(across(c("cyl", "carb"))) %>%
  summarise(across(c("disp", "hp"), list(mean = mean, sd = sd)))
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 9 x 6
#> # Groups:   cyl [3]
#>     cyl  carb disp_mean disp_sd hp_mean hp_sd
#>   <dbl> <dbl>     <dbl>   <dbl>   <dbl> <dbl>
#> 1     4     1      91.4   21.4     77.4 16.1 
#> 2     4     2     117.    27.1     87   24.9 
#> 3     6     1     242.    23.3    108.   3.54
#> 4     6     4     164.     4.39   116.   7.51
#> 5     6     6     145     NA      175   NA   
#> 6     8     2     346.    43.4    162.  14.4 
#> 7     8     3     276.     0      180    0   
#> 8     8     4     406.    57.8    234   21.7 
#> 9     8     8     301     NA      335   NA
like image 996
spillway18 Avatar asked Aug 19 '21 19:08

spillway18


People also ask

What is summarize in Dplyr?

summarise() creates a new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input.

What does n() do in r?

The function n() returns the number of observations in a current group.

What does summarise in r?

Summarize Function in R Programming. As its name implies, the summarize function reduces a data frame to a summary of just one vector or value. Many times, these summaries are calculated by grouping observations using a factor or categorical variables first.

What does dot mean in Dplyr?

The dot is used within dplyr mainly (not exclusively) in mutate_each , summarise_each and do . In the first two (and their SE counterparts) it refers to all the columns to which the functions in funs are applied. In do it refers to the (potentially grouped) data. frame so you can reference single columns by using .

How do you summarize data in Python dplyr?

Basic dplyr Summarize We can use the basic summarize method by passing the data as the first parameter and the named parameter with a summary method. For example, below we pass the mean parameter to create a new column and we pass the mean () function call on the column we would like to summarize. This would add the mean of disp.

How do I add the mean of a column in dplyr?

For example, below we pass the mean parameter to create a new column and we pass the mean () function call on the column we would like to summarize. This would add the mean of disp. When working with dplyr and the tidyverse, we often use the pipe, %>% operator. With this, we can send the data set to our method to use.

How does dplyr work with data?

Much work with data involvces subsetting, defining new columns, sorting or otherwise manipulating the data. dplyr has five functions (verbs) for such actions, that all start with a data.frame or tbl_df and produce another one.

How do I use the basic summarize method in Python?

We can use the basic summarize method by passing the data as the first parameter and the named parameter with a summary method. For example, below we pass the mean parameter to create a new column and we pass the mean () function call on the column we would like to summarize. This would add the mean of disp. summarize(mtcars, mean = mean(disp))


Video Answer


2 Answers

With different functions on different columns, an option is to use collap from collapse

library(collapse)
collap(mtcars, ~ cyl + carb, custom = list(fmean = 4, fsd = 5))

-output

cyl   disp        hp carb
1   4  91.38 16.133815    1
2   4 116.60 24.859606    2
3   6 241.50  3.535534    1
4   6 163.80  7.505553    4
5   6 145.00        NA    6
6   8 345.50 14.433757    2
7   8 275.80  0.000000    3
8   8 405.50 21.725561    4
9   8 301.00        NA    8

Or the index can be dynamically generated with match

collap(mtcars, ~ cyl + carb, custom = list(fmean =
   match('disp', names(mtcars)), fsd = match('hp', names(mtcars))))

With tidyverse, an option is to loop over the column names of interest and the functions in map2 and do a join later

library(dplyr)
library(purrr)
library(stringr)
map2(c("disp", "hp"), c("mean", "sd"), ~
   mtcars %>%
      group_by(across(c('cyl', 'carb'))) %>% 
      summarise(across(all_of(.x), match.fun(.y), 
         .names = str_c("{.col}_", .y)), .groups = 'drop')) %>% 
    reduce(inner_join)

-output

# A tibble: 9 x 4
    cyl  carb disp_mean hp_sd
  <dbl> <dbl>     <dbl> <dbl>
1     4     1      91.4 16.1 
2     4     2     117.  24.9 
3     6     1     242.   3.54
4     6     4     164.   7.51
5     6     6     145   NA   
6     8     2     346.  14.4 
7     8     3     276.   0   
8     8     4     406.  21.7 
9     8     8     301   NA   
like image 164
akrun Avatar answered Oct 24 '22 04:10

akrun


I have a package on github {dplyover}

which can help with this kind of tasks. In this case we could use over2 to loop over two character vectors simultaniously. The first vector contains the variable names as string, which is why we have to wrap .x in sym() when applying a function to it. The second vector contains the function names, which we use as .y in a do.call. over2 creates the desired names automatically.

library(dplyr)
library(dplyover) # https://github.com/TimTeaFan/dplyover

mtcars %>%
  group_by(across(c("cyl", "carb"))) %>%
  summarise(over2(c("disp", "hp"),
                  c("mean", "sd"),
                  ~ do.call(.y, list(sym(.x)))
                  ))

#> `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
#> # A tibble: 9 x 4
#> # Groups:   cyl [3]
#>     cyl  carb disp_mean hp_sd
#>   <dbl> <dbl>     <dbl> <dbl>
#> 1     4     1      91.4 16.1 
#> 2     4     2     117.  24.9 
#> 3     6     1     242.   3.54
#> 4     6     4     164.   7.51
#> 5     6     6     145   NA   
#> 6     8     2     346.  14.4 
#> 7     8     3     276.   0   
#> 8     8     4     406.  21.7 
#> 9     8     8     301   NA

An alternative way building on the same logic is to use purrr::map2. However, here we have to put some effort into creating vectors with the desired names.

library(purrr)

# setup vectors and names
myfuns <- c("mean", "sd")
myvars <- c("disp", "hp") %>%
  set_names(., paste(., myfuns, sep = "_"))

mtcars %>%
  group_by(across(c("cyl", "carb"))) %>%
  summarise(map2(myvars,
                 myfuns,
                 ~ do.call(.y, list(sym(.x)))
                 ) %>% bind_cols()
  )

#> `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
#> # A tibble: 9 x 4
#> # Groups:   cyl [3]
#>     cyl  carb disp_mean hp_sd
#>   <dbl> <dbl>     <dbl> <dbl>
#> 1     4     1      91.4 16.1 
#> 2     4     2     117.  24.9 
#> 3     6     1     242.   3.54
#> 4     6     4     164.   7.51
#> 5     6     6     145   NA   
#> 6     8     2     346.  14.4 
#> 7     8     3     276.   0   
#> 8     8     4     406.  21.7 
#> 9     8     8     301   NA

Created on 2021-08-20 by the reprex package (v2.0.1)

like image 29
TimTeaFan Avatar answered Oct 24 '22 04:10

TimTeaFan