Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr function with optional grouping only when argument provided

I need to write a dplyr function that creates a customised area plot. So here's my attempt.

area_plot <- function(data, what, by){
  by <- ensym(by)
  what <- ensym(what)

  data %>% 
    filter(!is.na(!!by)) %>% 
    group_by(date, !!by) %>% 
    summarise(!!what := sum(!!what, na.rm = TRUE)) %>% 
    complete(date, !!by, fill = rlang::list2(!!what := 0)) %>% 
    ggplot(aes(date, !!what, fill = !!by)) +
    geom_area(position = 'stack') +
    scale_x_date(breaks = '1 month', date_labels = '%Y-%m', expand = c(.01, .01)) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 90, vjust = .4)) +
    labs(fill = '')
}

But I've been wondering if there is any default value for by argument that would output geom_area plot for all groups together. I know that I can use if to define data used in ggplot2 first and do something like this inside a function:

if (by != 'default') {
    data <- data %>% 
    filter(!is.na(!!by)) %>% 
    group_by(date, !!by) %>% 
    summarise(!!what := sum(!!what, na.rm = TRUE)) %>% 
    complete(date, !!by, fill = rlang::list2(!!what := 0))}

ggplot(data, aes(date, !!what, fill = !!by)) +
geom_area(position = 'stack') +
scale_x_date(breaks = '1 month', date_labels = '%Y-%m', expand = c(.01, .01)) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = .4)) +
labs(fill = '')

But I ponder if there's a neat trick to provide some value (eg. constant) to group_by that would make summarise preserving original structure (so basically, do nothing) despite being called. A behaviour similar to that when you provide a constant to some aesthetic in ggplot2.

Please see the sample of the data attached. group is an optional grouping variable.

structure(list(date = structure(c(17052, 17654, 17111, 17402, 
17090, 17765, 17181, 17301, 17496, 17051, 16980, 17155, 17599, 
16986, 17607, 17620, 17328, 17085, 17666, 17759, 17238, 16975, 
17242, 17322, 17625, 17598, 17124, 17648, 17675, 17613, 17044, 
16984, 16968, 17421, 17152, 17148, 17418, 17017, 17655, 17148, 
16981, 17644, 17149, 17090, 17548, 17474, 17564, 17530, 17237, 
17679, 17166, 17470, 17427, 17306, 17677, 17600, 17458, 17697, 
17602, 16990, 17111, 17150, 17561, 17406, 17135, 17181, 17014, 
17419, 17273, 17416, 17101, 17367, 17170, 17015, 17386, 17444, 
17507, 17592, 17058, 17292, 16966, 17756, 17239, 17479, 17260, 
17477, 16989, 17032, 17219, 17430, 17696, 17487, 17578, 17759, 
17269, 17634, 17279, 17478, 17222, 17296), class = "Date"), count = c(2, 
4, 2, 3, 6, 1, 4, 8, 1, 5, 1, 5, 1, 1, 2, 6, 3, 5, 2, 7, 3, 4, 
1, 3, 4, 2, 4, 1, 2, 3, 16, 1, 5, 4, 3, 4, 4, 6, 1, 3, 3, 1, 
3, 10, 5, 1, 4, 2, 2, 4, 5, 26, 4, 9, 3, 1, 3, 1, 4, 1, 2, 3, 
1, 13, 3, 1, 3, 1, 1, 3, 1, 3, 3, 4, 1, 2, 2, 3, 1, 9, 3, 1, 
2, 1, 4, 2, 1, 2, 4, 3, 2, 3, 1, 6, 5, 1, 2, 2, 3, 4), group = c("NON-FOOD", 
NA, NA, NA, NA, "MIX", NA, NA, "MIX", NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, "FOOD", NA, "FOOD", NA, NA, "MIX", 
NA, NA, NA, "FOOD", "FOOD", NA, NA, NA, NA, "FOOD", NA, NA, "FOOD", 
NA, NA, NA, "FOOD", NA, NA, NA, NA, NA, NA, NA, NA, "MIX", NA, 
NA, "FOOD", NA, "FOOD", NA, NA, "FOOD", NA, "FOOD", NA, NA, "NON-FOOD", 
NA, NA, "MIX", "NON-FOOD", NA, NA, NA, NA, NA, NA, "IMAGE", NA, 
"FOOD", NA, NA, NA, "FOOD", NA, "FOOD", NA, NA, NA, NA, NA, NA, 
NA, NA, "FOOD", "FOOD", NA, NA, NA)), row.names = c(73008L, 535553L, 
122359L, 321655L, 105632L, 646925L, 172409L, 256204L, 394666L, 
72385L, 20180L, 156162L, 478525L, 91409L, 485397L, 501386L, 277336L, 
100902L, 549629L, 640676L, 209400L, 16603L, 224543L, 272638L, 
505291L, 475497L, 131845L, 529041L, 558295L, 491746L, 67156L, 
23499L, 11150L, 334454L, 154958L, 150674L, 333348L, 45599L, 536064L, 
150673L, 20668L, 524095L, 151809L, 105713L, 433853L, 375687L, 
445626L, 420587L, 208594L, 562514L, 162403L, 372594L, 338509L, 
259784L, 560356L, 480072L, 361471L, 579474L, 481262L, 26469L, 
122119L, 152537L, 443426L, 325045L, 140531L, 171908L, 43547L, 
333968L, 237152L, 332106L, 114754L, 298081L, 164923L, 43577L, 
311250L, 350267L, 404348L, 470188L, 78329L, 250086L, 9486L, 638289L, 
209638L, 379370L, 227299L, 377487L, 26333L, 55058L, 195261L, 
340666L, 578515L, 387600L, 457752L, 640729L, 235389L, 514348L, 
240303L, 378836L, 197409L, 252746L), class = "data.frame")
like image 679
Kuba_ Avatar asked Oct 16 '18 12:10

Kuba_


1 Answers

Here's one way to do the first few steps of your function (I didn't go into all the ggplot stuff, just how you could approach grouping). In general, to set a default "do nothing" action, such as default to not grouping, you'll use argument = NULL in your function--you can look around at other functions' doc pages to see how this is done. Here's an SO post on the difference between NA and NULL.

I'm not super adept at working with quosures, but I've built a few functions and often rely on some rlang/tidyselect helper functions, such as rlang::quo_is_null that I'm using here. Someone else may be able to rewrite this without helpers.

First to see the behavior you're looking for, as the grouped or not grouped summaries:

library(tidyverse)

# grouped
df %>%
  filter(!is.na(group)) %>%
  group_by(group) %>%
  summarise(count = sum(count, na.rm = TRUE))
#> # A tibble: 4 x 2
#>   group    count
#>   <chr>    <dbl>
#> 1 FOOD        34
#> 2 IMAGE        1
#> 3 MIX          8
#> 4 NON-FOOD     6

# not grouped
df %>%
  # add in if you want to filter ungrouped data
  summarise(count = sum(count, na.rm = TRUE))
#>   count
#> 1   347

Then in the function, I create what_var as the quosure version of what (rlang experts, feel free to correct me on this terminology...?). I generally add _var to names to keep track of what's the original argument and what's been enquoed already. Check for whether the argument by is null by creating a quosure of by and checking whether that is null. If it's not null, i.e. if some column name was supplied for by, filter and group by that quosure. If it is null, just pass along the original data frame. I pass the data to a new variable in the else statement to avoid operating on the original data frame. Then, regardless of whether the data is grouped, summarize what.

to_group_or_not_to_group <- function(data, what, by = NULL) {
  what_var <- enquo(what)

  if(!rlang::quo_is_null(enquo(by))) {
    by_var <- enquo(by)

    grouped_or_not <- data %>%
      filter(!is.na(!!by_var)) %>%
      group_by(!!by_var)
  } else {
    grouped_or_not <- data
  }

  grouped_or_not %>%
    summarise(!!quo_name(what_var) := sum(!!what_var, na.rm = TRUE))

}

Verify that you got your targeted results. With a grouping variable:

df %>%
  to_group_or_not_to_group(what = count, by = group)
#> # A tibble: 4 x 2
#>   group    count
#>   <chr>    <dbl>
#> 1 FOOD        34
#> 2 IMAGE        1
#> 3 MIX          8
#> 4 NON-FOOD     6

Supplying NULL as the (absence of) grouping variable:

df %>%
  to_group_or_not_to_group(what = count, by = NULL)
#>   count
#> 1   347

Without a grouping variable, falling back on the default by = NULL:

df %>%
  to_group_or_not_to_group(what = count)
#>   count
#> 1   347

Created on 2018-10-16 by the reprex package (v0.2.1)

like image 146
camille Avatar answered Nov 09 '22 02:11

camille