I would like to create a function in R, similar to dplyr
's group_by
function, that when combined with summarise
can give summary statistics for a dataset where group membership is not mutually exclusive. I.e., observations can belong to multiple groups. One way to think about it might be to consider tags; observations may belong to one or more tags which might overlap.
For example, take R's esoph
dataset (https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/esoph.html) documenting a case-control study of esophageal cancer. Suppose I'm interested in the number and proportion of cancer cases overall and per 'tag', where the tags are: 65+ years old; 80+ gm/day alcohol; 20+ gm/day tobacco; and a 'high risk' group where the previous 3 criteria are met.
Let's transform the dataset to long format (one participant per row) and then add these tags (logical columns) to the dataset:
library('dplyr')
data(esoph)
esophlong = bind_rows(esoph %>% .[rep(seq_len(nrow(.)), .$ncases), 1:3] %>% mutate(case=1),
esoph %>% .[rep(seq_len(nrow(.)), .$ncontrols), 1:3] %>% mutate(case=0)
) %>%
mutate(highage=(agegp %in% c('65-74','75+')),
highalc=(alcgp %in% c('80-119','120+')),
hightob=(tobgp %in% c('20-29','30+')),
highrisk=(highage & highalc & hightob)
)
My usual approach is to create a dataset where each observation is duplicated for every tag it belongs to, and then summarise
this dataset:
esophdup = bind_rows(esophlong %>% filter(highage) %>% mutate(tag='age>=65'),
esophlong %>% filter(highalc) %>% mutate(tag='alc>=80'),
esophlong %>% filter(hightob) %>% mutate(tag='tob>=20'),
esophlong %>% filter(highrisk) %>% mutate(tag='high risk'),
esophlong %>% filter() %>% mutate(tag='all')
) %>%
mutate(tag=factor(tag, levels = unique(.$tag)))
summary = esophdup %>%
group_by(tag) %>%
summarise(n=n(), ncases=sum(case), case.rate=mean(case))
This approach is inefficient for large datasets or for a large number of tags and I will often run out of memory to store it.
An alternative is to summarise
each tag separately and then bind these summary datasets afterwards, as follows:
summary.age = esophlong %>%
filter(highage) %>%
summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
mutate(tag='age>=65')
summary.alc = esophlong %>%
filter(highalc) %>%
summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
mutate(tag='alc>=80')
summary.tob = esophlong %>%
filter(hightob) %>%
summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
mutate(tag='tob>=20')
summary.highrisk = esophlong %>%
filter(highrisk) %>%
summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
mutate(tag='high risk')
summary.all = esophlong %>%
summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
mutate(tag='all')
summary=bind_rows(summary.age,summary.alc,summary.tob,summary.highrisk,summary.all)
This approach is time-consuming and tedious when I have a large number of tags or I want to reuse the tags often for different summary measures throughout a project.
The function I have in mind, say group_by_tags(data, key, ...)
, which includes an argument to specify the name of the grouping column, should work something like this:
summary = esophlong %>%
group_by_tags(key='tags',
'age>=65'=highage,
'alc>=80'=highalc,
'tob>=20'=hightob,
'high risk'=highrisk,
'all ages'=1
) %>%
summarise(n=n(), ncases=sum(case), case.rate=mean(case))
with the summary dataset looking like this:
> summary
tags n ncases case.rate
1 age>=65 273 68 0.2490842
2 alc>=80 301 96 0.3189369
3 tob>=20 278 64 0.2302158
4 high risk 11 5 0.4545455
5 all 1175 200 0.1702128
Even better, it could take variables of type "factor" as well as "logical" so that it could summarise, say, each age group individually, the 65+ year olds, and everybody:
summaryage = esophlong %>%
group_by_tags(key='Age.group',
agegp,
'65+'=(agegp %in% c('65-74','75+')),
'all'=1
) %>%
summarise(n=n(), ncases=sum(case), case.rate=mean(case))
>summaryage
Age.group n ncases case.rate
1 25-34 117 1 0.0085470
2 35-44 208 9 0.0432692
3 45-54 259 46 0.1776062
4 55-64 318 76 0.2389937
5 65-74 216 55 0.2546296
6 75+ 57 13 0.2280702
7 65+ 273 68 0.2490842
8 all 1175 200 0.1702128
Perhaps it's not possible with ...
and instead you might need to pass a vector/list of column names for the tags.
Any ideas?
EDIT: to be clear, the solution should take tag/group definitions and the required summary statistics as arguments, rather than being built into the function itself. Either as a two-step data %>% group_by_tags(tags) %>% summarise_tags(stats)
or a one-step data %>% summary_tags(tags,stats)
process.
This is a variation on @eddi's answer. I am taking the definitions of highage
et al as part of the function's job:
library(data.table)
custom_summary = function(DT, tags, stats){
setDT(DT)
rows = stack(lapply(tags[-1], function(x) DT[eval(x), which=TRUE]))
DT[rows$values, eval(stats), by=.(tag = rows$ind)]
}
And some example usage:
data(esoph)
library(dplyr)
esophlong = bind_rows(esoph %>% .[rep(seq_len(nrow(.)), .$ncases), 1:3] %>% mutate(case=1),
esoph %>% .[rep(seq_len(nrow(.)), .$ncontrols), 1:3] %>% mutate(case=0)
)
custom_summary(
DT = esophlong,
tags = quote(list(
'age>=65' = agegp %in% c('65-74','75+'),
'alc>=80' = alcgp %in% c('80-119','120+'),
'tob>=20' = tobgp %in% c('20-29','30+'),
'high risk' = eval(substitute(`age>=65` & `alc>=80` & `tob>=20`, as.list(tags))),
'all ages' = TRUE
)),
stats = quote(list(
n = .N,
n_cases = sum(case),
case.rate = mean(case)
))
)
tag n n_cases case.rate
1: age>=65 273 68 0.2490842
2: alc>=80 301 96 0.3189369
3: tob>=20 278 64 0.2302158
4: high risk 11 5 0.4545455
5: all ages 1175 200 0.1702128
The technique of using eval
inside DT[...]
is explained in the data.table FAQ.
Not a completely functional answer, more "WIP" or start for discourse. This should ultimately go in a repo and either an additional package or a PR for dplyr
.
One way is to mimic the attributes' structure from a "normally" grouped variable:
library(dplyr)
esoph %>% group_by(agegp, alcgp) %>% attributes %>% str
# List of 9
# $ names : chr [1:5] "agegp" "alcgp" "tobgp" "ncases" ...
# $ row.names : int [1:88] 1 2 3 4 5 6 7 8 9 10 ...
# $ class : chr [1:4] "grouped_df" "tbl_df" "tbl" "data.frame"
# $ vars :List of 2
# ..$ : symbol agegp
# ..$ : symbol alcgp
# $ drop : logi TRUE
# $ indices :List of 24
# ..$ : int [1:4] 0 1 2 3
# ..$ : int [1:4] 4 5 6 7
# ..$ : int [1:3] 8 9 10
# ...........
# $ group_sizes : int [1:24] 4 4 3 4 4 4 4 3 4 4 ...
# $ biggest_group_size: int 4
# $ labels :'data.frame': 24 obs. of 2 variables:
# ..$ agegp: Ord.factor w/ 6 levels "25-34"<"35-44"<..: 1 1 1 1 2 2 2 2 3 3 ...
# ..$ alcgp: Ord.factor w/ 4 levels "0-39g/day"<"40-79"<..: 1 2 3 4 1 2 3 4 1 2 ...
# ..- attr(*, "vars")=List of 2
# .. ..$ : symbol agegp
# .. ..$ : symbol alcgp
# ..- attr(*, "drop")= logi TRUE
We can reproduce this artificially to see if/how it might work:
esoph2 <- esoph
syms <- list(as.symbol("agegp65"), as.symbol("alcgp80"))
attr(esoph2, "vars") <- syms
attr(esoph2, "drop") <- TRUE
# 'agegp' and 'aclgp' are ordered factors, for simplicity here just using ints
# `group_by` indices are 0-based
indices <- list(
which(as.integer(esoph2$agegp) >= 5) - 1,
which(as.integer(esoph2$alcgp) >= 3) - 1
)
attr(esoph2, "indices") <- indices
attr(esoph2, "group_sizes") <- lengths(indices)
attr(esoph2, "biggest_group_size") <- max(lengths(indices))
df <- data.frame(agegp65 = "agegp >= 65", alcgp80 = "alcgp >= 80", stringsAsFactors = FALSE)
attr(df, "vars") <- syms
attr(esoph2, "labels") <- df
class(esoph2) <- c("grouped_df", "tbl_df", "tbl", "data.frame")
Which "looks" like a normal grouped data.frame:
str(esoph2)
# Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 88 obs. of 5 variables:
# $ agegp : Ord.factor w/ 6 levels "25-34"<"35-44"<..: 1 1 1 1 1 1 1 1 1 1 ...
# $ alcgp : Ord.factor w/ 4 levels "0-39g/day"<"40-79"<..: 1 1 1 1 2 2 2 2 3 3 ...
# $ tobgp : Ord.factor w/ 4 levels "0-9g/day"<"10-19"<..: 1 2 3 4 1 2 3 4 1 2 ...
# $ ncases : num 0 0 0 0 0 0 0 0 0 0 ...
# $ ncontrols: num 40 10 6 5 27 7 4 7 2 1 ...
# - attr(*, "vars")=List of 2
# ..$ : symbol agegp65
# ..$ : symbol alcgp80
# - attr(*, "drop")= logi TRUE
# - attr(*, "indices")=List of 2
# ..$ : num 62 63 64 65 66 67 68 69 70 71 ...
# ..$ : num 8 9 10 11 12 13 14 23 24 25 ...
# - attr(*, "group_sizes")= int 26 42
# - attr(*, "biggest_group_size")= int 42
# - attr(*, "labels")='data.frame': 1 obs. of 2 variables:
# ..$ agegp65: chr "agegp >= 65"
# ..$ alcgp80: chr "alcgp >= 80"
# ..- attr(*, "vars")=List of 2
# .. ..$ : symbol agegp65
# .. ..$ : symbol alcgp80
esoph2
# Source: local data frame [88 x 5]
# Groups: agegp65, alcgp80 [2]
# agegp alcgp tobgp ncases ncontrols
# <ord> <ord> <ord> <dbl> <dbl>
# 1 25-34 0-39g/day 0-9g/day 0 40
# 2 25-34 0-39g/day 10-19 0 10
# 3 25-34 0-39g/day 20-29 0 6
# 4 25-34 0-39g/day 30+ 0 5
# 5 25-34 40-79 0-9g/day 0 27
# 6 25-34 40-79 10-19 0 7
# 7 25-34 40-79 20-29 0 4
# 8 25-34 40-79 30+ 0 7
# 9 25-34 80-119 0-9g/day 0 2
# 10 25-34 80-119 10-19 0 1
# # ... with 78 more rows
Unfortunately:
esoph2 %>% summarize(n = n())
# Error: corrupt 'grouped_df', contains 88 rows, and 68 rows in groups
Ergo my comment that summarize
assumes full coverage; you'd have to modify dplyr_summarise_impl
(in C++), perhaps making a third option to summarise_grouped
and summarise_not_grouped
.
library(data.table)
setDT(esophlong)
special.summary = function(dt, vars) {
rbindlist(lapply(seq_along(vars), function(i) {
var = vars[[i]]
if (is.logical(dt[, eval(var)])) {
dt[eval(var) == TRUE, .(.N, sum(case), mean(case))][, tag := names(vars)[i]][
, .SD, by = tag] # last step is a lazy version of setcolorder
} else {
dt[, .(.N, sum(case), mean(case)), by = .(tag = eval(var))]
}
}))
}
special.summary(esophlong, list('age>=65'=quote(highage),
'alc>=80'=quote(highalc),
'tob>=20'=quote(hightob),
'high risk'=quote(highrisk),
'all'=quote(TRUE)))
# tag N V2 V3
#1: age>=65 273 68 0.2490842
#2: alc>=80 301 96 0.3189369
#3: tob>=20 278 64 0.2302158
#4: high risk 11 5 0.4545455
#5: all 1175 200 0.1702128
special.summary(esophlong, list(quote(agegp),
'65+'=quote(agegp %in% c('65-74','75+')),
'all'=quote(TRUE)))
# tag N V2 V3
#1: 25-34 117 1 0.008547009
#2: 35-44 208 9 0.043269231
#3: 45-54 259 46 0.177606178
#4: 55-64 318 76 0.238993711
#5: 65-74 216 55 0.254629630
#6: 75+ 57 13 0.228070175
#7: 65+ 273 68 0.249084249
#8: all 1175 200 0.170212766
This can be made more customizable of course, and that's left as an exercise to the reader.
In the absence of any knowledge of tidyverse internals, I avoided trying to create group_by()
-type function whose output should be passed to summarise()
and instead made a single function that combines both (similar to other answers but, I hope, more user-friendly and generalisable).
Since group_by() %>% summarise()
returns the joint summary information for each nested combination of grouping variables, I chose the name summarise_marginal()
since it will return the marginal summary information for each grouping variable independently.
grouped_df
objectsFirst, a solution that doesn't work with grouped_df
classes, but is extended below:
summarise_marginal0 <- function(.tbl, .vars, ..., .removeF=FALSE){
dots <- quos(...)
.tbl %>%
transmute(!!! .vars) %>%
map_dfr(
~ summarise(group_by(.tbl, 'value'=., add = TRUE), !!! dots) %>% # piping .tbl %>% group_by() %>% summarise() evaluates in the wrong order for some reason
filter_at(vars('value'), all_vars(!(.==FALSE & .removeF))) %>% # to remove rows where a logical group is FALSE.
mutate_at(vars('value'), as.character) # standardises 'value' column in case map_dfr tries to convert logical to factor
, .id='group'
)
}
mtcars %>%
summarise_marginal0(
vars(cyl, am),
meanmpg = mean(mpg),
meanwt = mean(wt)
)
#> # A tibble: 5 x 4
#> group value meanmpg meanwt
#> <chr> <chr> <dbl> <dbl>
#> 1 cyl 4 26.66364 2.285727
#> 2 cyl 6 19.74286 3.117143
#> 3 cyl 8 15.10000 3.999214
#> 4 am 0 17.14737 3.768895
#> 5 am 1 24.39231 2.411000
Capturing groups with vars()
(like with summarise_at()
or mutate_at()
) neatly separates the groups from the summary functions, and allows new groups to be created on-the-fly:
mtcars %>%
summarise_marginal0(
vars(cyl, hp_lt100 = hp<100),
meanmpg = mean(mpg),
meanwt = mean(wt)
)
#> # A tibble: 5 x 4
#> group value meanmpg meanwt
#> <chr> <chr> <dbl> <dbl>
#> 1 cyl 4 26.66364 2.285727
#> 2 cyl 6 19.74286 3.117143
#> 3 cyl 8 15.10000 3.999214
#> 4 hp_lt100 FALSE 17.45217 3.569652
#> 5 hp_lt100 TRUE 26.83333 2.316667
We can use the .removeF
argument to remove FALSE
logical values. Useful if you want to summarise certain rows but not their compliment:
mtcars %>%
summarise_marginal0(
vars(cyl==6, hp_lt100 = hp<100, hp_lt200 = hp<200),
meanmpg = mean(mpg),
meanwt = mean(wt),
.removeF = TRUE
)
#> # A tibble: 3 x 4
#> group value meanmpg meanwt
#> <chr> <chr> <dbl> <dbl>
#> 1 cyl == 6 TRUE 19.74286 3.117143
#> 2 hp_lt100 TRUE 26.83333 2.316667
#> 3 hp_lt200 TRUE 21.96000 2.911320
Notice the how even without explicitly naming the cyl == 6
group we still get a useful name for it.
grouped_df
objectssummarise_marginal0()
can be extended to work with grouped_df
objects returned by group_by()
:
summarise_marginal <- function(.tbl, .vars, ...){
dots <- quos(...)
.tbl %>%
nest() %>%
mutate(
summarised = map(data, ~summarise_marginal0(., .vars, !!! dots))
) %>%
unnest(summarised) %>%
purrrlyr::slice_rows(group_vars(.tbl))
}
mtcars %>%
group_by(am) %>%
summarise_marginal(
vars(cyl, hp_lt100 = hp<100),
meanmpg = mean(mpg),
meanwt = mean(wt)
)
#> # A tibble: 10 x 5
#> # Groups: am [2]
#> am group value meanmpg meanwt
#> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 1 cyl 4 28.07500 2.042250
#> 2 1 cyl 6 20.56667 2.755000
#> 3 1 cyl 8 15.40000 3.370000
#> 4 1 hp_lt100 FALSE 20.61429 2.756857
#> 5 1 hp_lt100 TRUE 28.80000 2.007500
#> 6 0 cyl 4 22.90000 2.935000
#> 7 0 cyl 6 19.12500 3.388750
#> 8 0 cyl 8 15.05000 4.104083
#> 9 0 hp_lt100 FALSE 16.06875 3.925250
#> 10 0 hp_lt100 TRUE 22.90000 2.935000
In fact, summarise_marginal()
will work for both grouped and ungrouped data.frame
s so this function alone is suitable.
This is a useful solution, but given that group_by()
has uses beyond summarise()
, for example with nest()
or do()
, I think the idea of a group_by_marginal()
(or group_by_tag()
or whatever name is best) is worth pursuing.
Some remaining issues:
The function needs to convert integer, factor and logical columns to character so their values all sit nicely in the same values
column. This is a slight violation of tidy data principles, though is no different to how gather()
behaves.
Assuming a group_by_marginal()
function is possible, it's output cannot be passed to mutate()
without resolving the ambiguity of where to place the values from each group. From the example above, which value of meanmpg
should be given to a row with cyl==4
and am==0
? Both 26.66364
(from cyl==4
) and 17.14737
(from am==0
) are relevant. (Note there is no ambiguity for group_by() %>% mutate()
since it will return the joint summary function for cyl==4 & am==0
). Three possible options for group_by_marginal() %>% mutate()
:
meanmpg_cyl
and meanmpg_am
.Finally, to demonstrate on the original example problem:
bind_rows(
esoph %>% .[rep(seq_len(nrow(.)), .$ncases), 1:3] %>% mutate(case=1),
esoph %>% .[rep(seq_len(nrow(.)), .$ncontrols), 1:3] %>% mutate(case=0)
) %>%
summarise_marginal(
vars(highage = agegp %in% c('65-74','75+'),
highalc = alcgp %in% c('80-119','120+'),
hightob = tobgp %in% c('20-29','30+'),
highrisk = highage & highalc & hightob,
all = 1),
n=length(agegp),
ncases=sum(case),
case.rate=mean(case),
.removeF=TRUE
)
#> # A tibble: 5 x 5
#> group value n ncases case.rate
#> <chr> <chr> <int> <dbl> <dbl>
#> 1 highage TRUE 273 68 0.2490842
#> 2 highalc TRUE 301 96 0.3189369
#> 3 hightob TRUE 278 64 0.2302158
#> 4 highrisk TRUE 11 5 0.4545455
#> 5 all 1 1175 200 0.1702128
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With