I'm trying to do some dplyr programming and having trouble. I'd like to group_by
an arbitrary number of variables (thus, across
), and then summarize
based on arbitrary length (but all the same length) vectors of:
So, like in a map
or apply
statement, I want to execute code that ends up looking like:
data %>%
group_by(group_column) %>%
summarize(new_name_1 = function_1(column_1),
summarize(new_name_2 = function_2(column_2))
Here's an example of what I want and my best shot so far. I know I can use the names
argument to clean those up if I use across, but I'm not confident that across is the correct way. Finally, I'll be applying this to fairly large dataframes, so I'd rather not calculate the extra columns.
Desired result
mtcars %>%
group_by(across(c("cyl", "carb"))) %>%
summarise(across(c("disp", "hp"), list(mean = mean, sd = sd))) %>%
select(cyl, carb, disp_mean, hp_sd)
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 9 x 4
#> # Groups: cyl [3]
#> cyl carb disp_mean hp_sd
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 1 91.4 16.1
#> 2 4 2 117. 24.9
#> 3 6 1 242. 3.54
#> 4 6 4 164. 7.51
#> 5 6 6 145 NA
#> 6 8 2 346. 14.4
#> 7 8 3 276. 0
#> 8 8 4 406. 21.7
#> 9 8 8 301 NA
What I get
mtcars %>%
group_by(across(c("cyl", "carb"))) %>%
summarise(across(c("disp", "hp"), list(mean = mean, sd = sd)))
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 9 x 6
#> # Groups: cyl [3]
#> cyl carb disp_mean disp_sd hp_mean hp_sd
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 1 91.4 21.4 77.4 16.1
#> 2 4 2 117. 27.1 87 24.9
#> 3 6 1 242. 23.3 108. 3.54
#> 4 6 4 164. 4.39 116. 7.51
#> 5 6 6 145 NA 175 NA
#> 6 8 2 346. 43.4 162. 14.4
#> 7 8 3 276. 0 180 0
#> 8 8 4 406. 57.8 234 21.7
#> 9 8 8 301 NA 335 NA
summarise() creates a new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input.
The function n() returns the number of observations in a current group.
Summarize Function in R Programming. As its name implies, the summarize function reduces a data frame to a summary of just one vector or value. Many times, these summaries are calculated by grouping observations using a factor or categorical variables first.
The dot is used within dplyr mainly (not exclusively) in mutate_each , summarise_each and do . In the first two (and their SE counterparts) it refers to all the columns to which the functions in funs are applied. In do it refers to the (potentially grouped) data. frame so you can reference single columns by using .
Basic dplyr Summarize We can use the basic summarize method by passing the data as the first parameter and the named parameter with a summary method. For example, below we pass the mean parameter to create a new column and we pass the mean () function call on the column we would like to summarize. This would add the mean of disp.
For example, below we pass the mean parameter to create a new column and we pass the mean () function call on the column we would like to summarize. This would add the mean of disp. When working with dplyr and the tidyverse, we often use the pipe, %>% operator. With this, we can send the data set to our method to use.
Much work with data involvces subsetting, defining new columns, sorting or otherwise manipulating the data. dplyr has five functions (verbs) for such actions, that all start with a data.frame or tbl_df and produce another one.
We can use the basic summarize method by passing the data as the first parameter and the named parameter with a summary method. For example, below we pass the mean parameter to create a new column and we pass the mean () function call on the column we would like to summarize. This would add the mean of disp. summarize(mtcars, mean = mean(disp))
With different functions on different columns, an option is to use collap
from collapse
library(collapse)
collap(mtcars, ~ cyl + carb, custom = list(fmean = 4, fsd = 5))
-output
cyl disp hp carb
1 4 91.38 16.133815 1
2 4 116.60 24.859606 2
3 6 241.50 3.535534 1
4 6 163.80 7.505553 4
5 6 145.00 NA 6
6 8 345.50 14.433757 2
7 8 275.80 0.000000 3
8 8 405.50 21.725561 4
9 8 301.00 NA 8
Or the index can be dynamically generated with match
collap(mtcars, ~ cyl + carb, custom = list(fmean =
match('disp', names(mtcars)), fsd = match('hp', names(mtcars))))
With tidyverse
, an option is to loop over the column names of interest and the functions in map2
and do a join later
library(dplyr)
library(purrr)
library(stringr)
map2(c("disp", "hp"), c("mean", "sd"), ~
mtcars %>%
group_by(across(c('cyl', 'carb'))) %>%
summarise(across(all_of(.x), match.fun(.y),
.names = str_c("{.col}_", .y)), .groups = 'drop')) %>%
reduce(inner_join)
-output
# A tibble: 9 x 4
cyl carb disp_mean hp_sd
<dbl> <dbl> <dbl> <dbl>
1 4 1 91.4 16.1
2 4 2 117. 24.9
3 6 1 242. 3.54
4 6 4 164. 7.51
5 6 6 145 NA
6 8 2 346. 14.4
7 8 3 276. 0
8 8 4 406. 21.7
9 8 8 301 NA
I have a package on github {dplyover}
which can help with this kind of tasks. In this case we could use over2
to
loop over two character vectors simultaniously. The first vector contains the
variable names as string, which is why we have to wrap .x
in sym()
when
applying a function to it. The second vector contains the function names,
which we use as .y
in a do.call
. over2
creates the desired names automatically.
library(dplyr)
library(dplyover) # https://github.com/TimTeaFan/dplyover
mtcars %>%
group_by(across(c("cyl", "carb"))) %>%
summarise(over2(c("disp", "hp"),
c("mean", "sd"),
~ do.call(.y, list(sym(.x)))
))
#> `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
#> # A tibble: 9 x 4
#> # Groups: cyl [3]
#> cyl carb disp_mean hp_sd
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 1 91.4 16.1
#> 2 4 2 117. 24.9
#> 3 6 1 242. 3.54
#> 4 6 4 164. 7.51
#> 5 6 6 145 NA
#> 6 8 2 346. 14.4
#> 7 8 3 276. 0
#> 8 8 4 406. 21.7
#> 9 8 8 301 NA
An alternative way building on the same logic is to use purrr::map2
. However,
here we have to put some effort into creating vectors with the desired names.
library(purrr)
# setup vectors and names
myfuns <- c("mean", "sd")
myvars <- c("disp", "hp") %>%
set_names(., paste(., myfuns, sep = "_"))
mtcars %>%
group_by(across(c("cyl", "carb"))) %>%
summarise(map2(myvars,
myfuns,
~ do.call(.y, list(sym(.x)))
) %>% bind_cols()
)
#> `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
#> # A tibble: 9 x 4
#> # Groups: cyl [3]
#> cyl carb disp_mean hp_sd
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 1 91.4 16.1
#> 2 4 2 117. 24.9
#> 3 6 1 242. 3.54
#> 4 6 4 164. 7.51
#> 5 6 6 145 NA
#> 6 8 2 346. 14.4
#> 7 8 3 276. 0
#> 8 8 4 406. 21.7
#> 9 8 8 301 NA
Created on 2021-08-20 by the reprex package (v2.0.1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With