Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Function similar to group_by when groups are not mutually exlcusive

Tags:

r

group-by

dplyr

I would like to create a function in R, similar to dplyr's group_by function, that when combined with summarise can give summary statistics for a dataset where group membership is not mutually exclusive. I.e., observations can belong to multiple groups. One way to think about it might be to consider tags; observations may belong to one or more tags which might overlap.

For example, take R's esoph dataset (https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/esoph.html) documenting a case-control study of esophageal cancer. Suppose I'm interested in the number and proportion of cancer cases overall and per 'tag', where the tags are: 65+ years old; 80+ gm/day alcohol; 20+ gm/day tobacco; and a 'high risk' group where the previous 3 criteria are met. Let's transform the dataset to long format (one participant per row) and then add these tags (logical columns) to the dataset:

library('dplyr')
data(esoph)
esophlong = bind_rows(esoph %>% .[rep(seq_len(nrow(.)), .$ncases), 1:3] %>% mutate(case=1),
                      esoph %>% .[rep(seq_len(nrow(.)), .$ncontrols), 1:3] %>% mutate(case=0)
            ) %>% 
            mutate(highage=(agegp %in% c('65-74','75+')),
                   highalc=(alcgp %in% c('80-119','120+')),
                   hightob=(tobgp %in% c('20-29','30+')),
                   highrisk=(highage & highalc & hightob)
            )

My usual approach is to create a dataset where each observation is duplicated for every tag it belongs to, and then summarise this dataset:

esophdup = bind_rows(esophlong %>% filter(highage) %>% mutate(tag='age>=65'),
                     esophlong %>% filter(highalc) %>% mutate(tag='alc>=80'),
                     esophlong %>% filter(hightob) %>% mutate(tag='tob>=20'),
                     esophlong %>% filter(highrisk) %>% mutate(tag='high risk'),
                     esophlong %>% filter() %>% mutate(tag='all')
           ) %>%
           mutate(tag=factor(tag, levels = unique(.$tag)))

summary = esophdup %>%
          group_by(tag) %>%
          summarise(n=n(), ncases=sum(case), case.rate=mean(case))

This approach is inefficient for large datasets or for a large number of tags and I will often run out of memory to store it.

An alternative is to summarise each tag separately and then bind these summary datasets afterwards, as follows:

summary.age = esophlong %>%
              filter(highage) %>%
              summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
              mutate(tag='age>=65')

summary.alc = esophlong %>%
              filter(highalc) %>%
              summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
              mutate(tag='alc>=80')

summary.tob = esophlong %>%
              filter(hightob) %>%
              summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
              mutate(tag='tob>=20')

summary.highrisk = esophlong %>%
              filter(highrisk) %>%
              summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
              mutate(tag='high risk')

summary.all = esophlong %>%
              summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
              mutate(tag='all')

summary=bind_rows(summary.age,summary.alc,summary.tob,summary.highrisk,summary.all)  

This approach is time-consuming and tedious when I have a large number of tags or I want to reuse the tags often for different summary measures throughout a project.

The function I have in mind, say group_by_tags(data, key, ...), which includes an argument to specify the name of the grouping column, should work something like this:

summary = esophlong %>% 
          group_by_tags(key='tags',
                        'age>=65'=highage,
                        'alc>=80'=highalc,
                        'tob>=20'=hightob,
                        'high risk'=highrisk,
                        'all ages'=1
          ) %>%
          summarise(n=n(), ncases=sum(case), case.rate=mean(case))

with the summary dataset looking like this:

> summary
       tags     n ncases case.rate
1   age>=65   273     68 0.2490842
2   alc>=80   301     96 0.3189369
3   tob>=20   278     64 0.2302158
4 high risk    11      5 0.4545455
5       all  1175    200 0.1702128

Even better, it could take variables of type "factor" as well as "logical" so that it could summarise, say, each age group individually, the 65+ year olds, and everybody:

summaryage = esophlong %>% 
          group_by_tags(key='Age.group',
                        agegp,
                        '65+'=(agegp %in% c('65-74','75+')),
                        'all'=1                 
          ) %>%
          summarise(n=n(), ncases=sum(case), case.rate=mean(case))

>summaryage
  Age.group     n ncases case.rate
1     25-34   117      1 0.0085470
2     35-44   208      9 0.0432692
3     45-54   259     46 0.1776062
4     55-64   318     76 0.2389937
5     65-74   216     55 0.2546296
6       75+    57     13 0.2280702
7       65+   273     68 0.2490842
8       all  1175    200 0.1702128

Perhaps it's not possible with ... and instead you might need to pass a vector/list of column names for the tags.

Any ideas?

EDIT: to be clear, the solution should take tag/group definitions and the required summary statistics as arguments, rather than being built into the function itself. Either as a two-step data %>% group_by_tags(tags) %>% summarise_tags(stats) or a one-step data %>% summary_tags(tags,stats) process.

like image 579
wjchulme Avatar asked Aug 23 '16 16:08

wjchulme


4 Answers

This is a variation on @eddi's answer. I am taking the definitions of highage et al as part of the function's job:

library(data.table)
custom_summary = function(DT, tags, stats){
    setDT(DT)
    rows = stack(lapply(tags[-1], function(x) DT[eval(x), which=TRUE]))
    DT[rows$values, eval(stats), by=.(tag = rows$ind)]
}

And some example usage:

data(esoph)
library(dplyr)
esophlong = bind_rows(esoph %>% .[rep(seq_len(nrow(.)), .$ncases), 1:3] %>% mutate(case=1),
                      esoph %>% .[rep(seq_len(nrow(.)), .$ncontrols), 1:3] %>% mutate(case=0)
            )

custom_summary(
    DT = esophlong, 
    tags = quote(list(
        'age>=65'   = agegp %in% c('65-74','75+'),
        'alc>=80'   = alcgp %in% c('80-119','120+'),
        'tob>=20'   = tobgp %in% c('20-29','30+'),
        'high risk' = eval(substitute(`age>=65` & `alc>=80` & `tob>=20`, as.list(tags))),
        'all ages'  = TRUE
    )),
    stats = quote(list(
        n           = .N, 
        n_cases     = sum(case), 
        case.rate   = mean(case)
    ))
)

         tag    n n_cases case.rate
1:   age>=65  273      68 0.2490842
2:   alc>=80  301      96 0.3189369
3:   tob>=20  278      64 0.2302158
4: high risk   11       5 0.4545455
5:  all ages 1175     200 0.1702128

The technique of using eval inside DT[...] is explained in the data.table FAQ.

like image 159
Frank Avatar answered Nov 15 '22 08:11

Frank


Not a completely functional answer, more "WIP" or start for discourse. This should ultimately go in a repo and either an additional package or a PR for dplyr.

One way is to mimic the attributes' structure from a "normally" grouped variable:

library(dplyr)
esoph %>% group_by(agegp, alcgp) %>% attributes %>% str
# List of 9
#  $ names             : chr [1:5] "agegp" "alcgp" "tobgp" "ncases" ...
#  $ row.names         : int [1:88] 1 2 3 4 5 6 7 8 9 10 ...
#  $ class             : chr [1:4] "grouped_df" "tbl_df" "tbl" "data.frame"
#  $ vars              :List of 2
#   ..$ : symbol agegp
#   ..$ : symbol alcgp
#  $ drop              : logi TRUE
#  $ indices           :List of 24
#   ..$ : int [1:4] 0 1 2 3
#   ..$ : int [1:4] 4 5 6 7
#   ..$ : int [1:3] 8 9 10
#   ...........
#  $ group_sizes       : int [1:24] 4 4 3 4 4 4 4 3 4 4 ...
#  $ biggest_group_size: int 4
#  $ labels            :'data.frame':   24 obs. of  2 variables:
#   ..$ agegp: Ord.factor w/ 6 levels "25-34"<"35-44"<..: 1 1 1 1 2 2 2 2 3 3 ...
#   ..$ alcgp: Ord.factor w/ 4 levels "0-39g/day"<"40-79"<..: 1 2 3 4 1 2 3 4 1 2 ...
#   ..- attr(*, "vars")=List of 2
#   .. ..$ : symbol agegp
#   .. ..$ : symbol alcgp
#   ..- attr(*, "drop")= logi TRUE

We can reproduce this artificially to see if/how it might work:

esoph2 <- esoph
syms <- list(as.symbol("agegp65"), as.symbol("alcgp80"))
attr(esoph2, "vars") <- syms
attr(esoph2, "drop") <- TRUE
# 'agegp' and 'aclgp' are ordered factors, for simplicity here just using ints
# `group_by` indices are 0-based
indices <- list(
  which(as.integer(esoph2$agegp) >= 5) - 1,
  which(as.integer(esoph2$alcgp) >= 3) - 1
)
attr(esoph2, "indices") <- indices
attr(esoph2, "group_sizes") <- lengths(indices)
attr(esoph2, "biggest_group_size") <- max(lengths(indices))
df <- data.frame(agegp65 = "agegp >= 65", alcgp80 = "alcgp >= 80", stringsAsFactors = FALSE)
attr(df, "vars") <- syms
attr(esoph2, "labels") <- df
class(esoph2) <- c("grouped_df", "tbl_df", "tbl", "data.frame")

Which "looks" like a normal grouped data.frame:

str(esoph2)
# Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':   88 obs. of  5 variables:
#  $ agegp    : Ord.factor w/ 6 levels "25-34"<"35-44"<..: 1 1 1 1 1 1 1 1 1 1 ...
#  $ alcgp    : Ord.factor w/ 4 levels "0-39g/day"<"40-79"<..: 1 1 1 1 2 2 2 2 3 3 ...
#  $ tobgp    : Ord.factor w/ 4 levels "0-9g/day"<"10-19"<..: 1 2 3 4 1 2 3 4 1 2 ...
#  $ ncases   : num  0 0 0 0 0 0 0 0 0 0 ...
#  $ ncontrols: num  40 10 6 5 27 7 4 7 2 1 ...
#  - attr(*, "vars")=List of 2
#   ..$ : symbol agegp65
#   ..$ : symbol alcgp80
#  - attr(*, "drop")= logi TRUE
#  - attr(*, "indices")=List of 2
#   ..$ : num  62 63 64 65 66 67 68 69 70 71 ...
#   ..$ : num  8 9 10 11 12 13 14 23 24 25 ...
#  - attr(*, "group_sizes")= int  26 42
#  - attr(*, "biggest_group_size")= int 42
#  - attr(*, "labels")='data.frame':    1 obs. of  2 variables:
#   ..$ agegp65: chr "agegp >= 65"
#   ..$ alcgp80: chr "alcgp >= 80"
#   ..- attr(*, "vars")=List of 2
#   .. ..$ : symbol agegp65
#   .. ..$ : symbol alcgp80
esoph2
# Source: local data frame [88 x 5]
# Groups: agegp65, alcgp80 [2]
#    agegp     alcgp    tobgp ncases ncontrols
#    <ord>     <ord>    <ord>  <dbl>     <dbl>
# 1  25-34 0-39g/day 0-9g/day      0        40
# 2  25-34 0-39g/day    10-19      0        10
# 3  25-34 0-39g/day    20-29      0         6
# 4  25-34 0-39g/day      30+      0         5
# 5  25-34     40-79 0-9g/day      0        27
# 6  25-34     40-79    10-19      0         7
# 7  25-34     40-79    20-29      0         4
# 8  25-34     40-79      30+      0         7
# 9  25-34    80-119 0-9g/day      0         2
# 10 25-34    80-119    10-19      0         1
# # ... with 78 more rows

Unfortunately:

esoph2 %>% summarize(n = n())
# Error: corrupt 'grouped_df', contains 88 rows, and 68 rows in groups

Ergo my comment that summarize assumes full coverage; you'd have to modify dplyr_summarise_impl (in C++), perhaps making a third option to summarise_grouped and summarise_not_grouped.

like image 22
r2evans Avatar answered Nov 15 '22 10:11

r2evans


library(data.table)
setDT(esophlong)

special.summary = function(dt, vars) {
  rbindlist(lapply(seq_along(vars), function(i) {
      var = vars[[i]]
      if (is.logical(dt[, eval(var)])) {
        dt[eval(var) == TRUE, .(.N, sum(case), mean(case))][, tag := names(vars)[i]][
           , .SD, by = tag] # last step is a lazy version of setcolorder
      } else {
        dt[, .(.N, sum(case), mean(case)), by = .(tag = eval(var))]
      }
    }))
}

special.summary(esophlong, list('age>=65'=quote(highage),
                                'alc>=80'=quote(highalc),
                                'tob>=20'=quote(hightob),
                                'high risk'=quote(highrisk),
                                'all'=quote(TRUE)))

#         tag    N  V2        V3
#1:   age>=65  273  68 0.2490842
#2:   alc>=80  301  96 0.3189369
#3:   tob>=20  278  64 0.2302158
#4: high risk   11   5 0.4545455
#5:       all 1175 200 0.1702128

special.summary(esophlong, list(quote(agegp),
                                '65+'=quote(agegp %in% c('65-74','75+')),
                                'all'=quote(TRUE)))

#     tag    N  V2          V3
#1: 25-34  117   1 0.008547009
#2: 35-44  208   9 0.043269231
#3: 45-54  259  46 0.177606178
#4: 55-64  318  76 0.238993711
#5: 65-74  216  55 0.254629630
#6:   75+   57  13 0.228070175
#7:   65+  273  68 0.249084249
#8:   all 1175 200 0.170212766

This can be made more customizable of course, and that's left as an exercise to the reader.

like image 24
eddi Avatar answered Nov 15 '22 10:11

eddi


In the absence of any knowledge of tidyverse internals, I avoided trying to create group_by()-type function whose output should be passed to summarise() and instead made a single function that combines both (similar to other answers but, I hope, more user-friendly and generalisable).

Since group_by() %>% summarise() returns the joint summary information for each nested combination of grouping variables, I chose the name summarise_marginal() since it will return the marginal summary information for each grouping variable independently.

Solution that doesn't work with grouped_df objects

First, a solution that doesn't work with grouped_df classes, but is extended below:

summarise_marginal0 <- function(.tbl, .vars, ..., .removeF=FALSE){

  dots <- quos(...)

  .tbl %>% 
    transmute(!!! .vars) %>% 
    map_dfr(
      ~ summarise(group_by(.tbl, 'value'=., add = TRUE), !!! dots) %>%  # piping .tbl %>% group_by() %>% summarise() evaluates in the wrong order for some reason
      filter_at(vars('value'), all_vars(!(.==FALSE & .removeF))) %>%  # to remove rows where a logical group is FALSE.
      mutate_at(vars('value'), as.character)  # standardises 'value' column in case map_dfr tries to convert logical to factor
      , .id='group'
    )
}


mtcars %>% 
  summarise_marginal0(
    vars(cyl, am),
    meanmpg = mean(mpg),
    meanwt = mean(wt)
  )

#> # A tibble: 5 x 4
#>   group value  meanmpg   meanwt
#>   <chr> <chr>    <dbl>    <dbl>
#> 1   cyl     4 26.66364 2.285727
#> 2   cyl     6 19.74286 3.117143
#> 3   cyl     8 15.10000 3.999214
#> 4    am     0 17.14737 3.768895
#> 5    am     1 24.39231 2.411000

Capturing groups with vars() (like with summarise_at() or mutate_at()) neatly separates the groups from the summary functions, and allows new groups to be created on-the-fly:

mtcars %>% 
  summarise_marginal0(
    vars(cyl, hp_lt100 = hp<100),
    meanmpg = mean(mpg),
    meanwt = mean(wt)
  )

#> # A tibble: 5 x 4
#>      group value  meanmpg   meanwt
#>      <chr> <chr>    <dbl>    <dbl>
#> 1      cyl     4 26.66364 2.285727
#> 2      cyl     6 19.74286 3.117143
#> 3      cyl     8 15.10000 3.999214
#> 4 hp_lt100 FALSE 17.45217 3.569652
#> 5 hp_lt100  TRUE 26.83333 2.316667

We can use the .removeF argument to remove FALSE logical values. Useful if you want to summarise certain rows but not their compliment:

mtcars %>% 
  summarise_marginal0(
    vars(cyl==6, hp_lt100 = hp<100, hp_lt200 = hp<200),
    meanmpg = mean(mpg),
    meanwt = mean(wt),
    .removeF = TRUE
  )

#> # A tibble: 3 x 4
#>      group value  meanmpg   meanwt
#>      <chr> <chr>    <dbl>    <dbl>
#> 1 cyl == 6  TRUE 19.74286 3.117143
#> 2 hp_lt100  TRUE 26.83333 2.316667
#> 3 hp_lt200  TRUE 21.96000 2.911320

Notice the how even without explicitly naming the cyl == 6 group we still get a useful name for it.

Solution that does work with grouped_df objects

summarise_marginal0() can be extended to work with grouped_df objects returned by group_by():

summarise_marginal <- function(.tbl, .vars, ...){

  dots <- quos(...)

  .tbl %>%
    nest() %>%
    mutate(
      summarised = map(data, ~summarise_marginal0(., .vars, !!! dots))
    ) %>% 
    unnest(summarised) %>%
    purrrlyr::slice_rows(group_vars(.tbl))
}


mtcars %>% 
  group_by(am) %>%
  summarise_marginal(
    vars(cyl, hp_lt100 = hp<100),
    meanmpg = mean(mpg),
    meanwt = mean(wt)
  )

#> # A tibble: 10 x 5
#> # Groups:   am [2]
#>       am    group value  meanmpg   meanwt
#>    <dbl>    <chr> <chr>    <dbl>    <dbl>
#>  1     1      cyl     4 28.07500 2.042250
#>  2     1      cyl     6 20.56667 2.755000
#>  3     1      cyl     8 15.40000 3.370000
#>  4     1 hp_lt100 FALSE 20.61429 2.756857
#>  5     1 hp_lt100  TRUE 28.80000 2.007500
#>  6     0      cyl     4 22.90000 2.935000
#>  7     0      cyl     6 19.12500 3.388750
#>  8     0      cyl     8 15.05000 4.104083
#>  9     0 hp_lt100 FALSE 16.06875 3.925250
#> 10     0 hp_lt100  TRUE 22.90000 2.935000

In fact, summarise_marginal() will work for both grouped and ungrouped data.frames so this function alone is suitable.

This is a useful solution, but given that group_by() has uses beyond summarise(), for example with nest() or do(), I think the idea of a group_by_marginal() (or group_by_tag() or whatever name is best) is worth pursuing.

Some remaining issues:

  • The function needs to convert integer, factor and logical columns to character so their values all sit nicely in the same values column. This is a slight violation of tidy data principles, though is no different to how gather() behaves.

  • Assuming a group_by_marginal() function is possible, it's output cannot be passed to mutate() without resolving the ambiguity of where to place the values from each group. From the example above, which value of meanmpg should be given to a row with cyl==4 and am==0? Both 26.66364 (from cyl==4) and 17.14737 (from am==0) are relevant. (Note there is no ambiguity for group_by() %>% mutate() since it will return the joint summary function for cyl==4 & am==0). Three possible options for group_by_marginal() %>% mutate():

    1. It should be forbidden.
    2. It should create multiple columns, eg, meanmpg_cyl and meanmpg_am.
    3. It should replicate rows for each group.
  • Speed. I'm certain my implementation of this concept is inefficient and could be improved.

Finally, to demonstrate on the original example problem:

bind_rows(
  esoph %>% .[rep(seq_len(nrow(.)), .$ncases), 1:3] %>% mutate(case=1),
  esoph %>% .[rep(seq_len(nrow(.)), .$ncontrols), 1:3] %>% mutate(case=0)
) %>%
summarise_marginal(
  vars(highage = agegp %in% c('65-74','75+'),
       highalc = alcgp %in% c('80-119','120+'),
       hightob = tobgp %in% c('20-29','30+'),
       highrisk = highage & highalc & hightob,
       all = 1),
  n=length(agegp),
  ncases=sum(case),
  case.rate=mean(case),
  .removeF=TRUE
)

#> # A tibble: 5 x 5
#>      group value     n ncases case.rate
#>      <chr> <chr> <int>  <dbl>     <dbl>
#> 1  highage  TRUE   273     68 0.2490842
#> 2  highalc  TRUE   301     96 0.3189369
#> 3  hightob  TRUE   278     64 0.2302158
#> 4 highrisk  TRUE    11      5 0.4545455
#> 5      all     1  1175    200 0.1702128
like image 38
wjchulme Avatar answered Nov 15 '22 10:11

wjchulme