Is there a way to dynamically/programmatically generate case_when
conditions in dplyr
with different column names and/or different numbers of conditions? I have an interactive script that I'm trying to convert into a function. There's a lot of repeated code in the case_when
statements and I'm wondering if it can be automated somehow without my needing to write everything from scratch again and again.
Here's a dummy dataset:
test_df = tibble(low_A=c(5, 15, NA),
low_TOT=c(NA, 10, NA),
low_B=c(20, 25, 30),
high_A=c(NA, NA, 10),
high_TOT=c(NA, 40, NA),
high_B=c(60, 20, NA))
expected_df = tibble(low_A=c(5, 15, NA),
low_TOT=c(NA, 10, NA),
low_B=c(20, 25, 30),
ans_low=c(5, 10, 30),
high_A=c(NA, NA, 10),
high_TOT=c(NA, 40, NA),
high_B=c(60, 20, NA),
ans_high=c(60, 40, 10))
> expected_df
# A tibble: 3 x 8
low_A low_TOT low_B ans_low high_A high_TOT high_B ans_high
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 5 NA 20 5 NA NA 60 60
2 15 10 25 10 NA 40 20 40
3 NA NA 30 30 10 NA NA 10
The logic I want is that if the ._TOT
column has a value, use that. If not, then try column ._A
, and if not, then column ._B
. Note that I intentionally didn't put ._TOT
as the first column for a group. I could just use coalesce() in that case, but I want a general solution irrespective of column order.
Of course, all of this is easy to do with a couple of case_when
statements. My issues are that:
_TOT, _A, _B
but with different prefixes (e.g., low_TOT, low_A, low_B, high_TOT, high_A, high_B,.....
and I don't want to rewrite a bunch of case_when
functions again and again.What I have right now looks like this (where I'm writing a case_when
for each prefix):
def my_function = function(df) {
df %>% mutate(
# If a total low doesn't exist, use A (if exists) or B (if exists)
"ans_low" := case_when(
!is.na(.data[["low_TOT"]]) ~ .data[["low_TOT"]],
!is.na(.data[["low_A"]]) ~ .data[["low_A"]],
!is.na(.data[["low_B"]]) ~ .data[["low_B"]],
),
# If a total high doesn't exist, use A (if exists) or B (if exists)
"ans_high" := case_when(
!is.na(.data[["high_TOT"]]) ~ .data[["high_TOT"]],
!is.na(.data[["high_A"]]) ~ .data[["high_R"]],
!is.na(.data[["high_B"]]) ~ .data[["high_B"]],
# Plus a whole bunch of similar case_when functions...
}
And what I'd like is to ideally get a way to dynamically generate case_when
functions with different conditions so that I'm not writing a new case_when
each time by exploiting the fact that:
high_
, low_
, etc.).!is.na( .data[[ . ]]) ~ .data[[ . ]]
, where the dot(.
) is the dynamically generated name of the column.What I'd like is something like:
def my_function = function(df) {
df %>% mutate(
"ans_low" := some_func(prefix="Low"),
"ans_high" := some_func(prefix="High")
}
I tried creating my own case_when
generator to replace the standard case_when
as shown below, but I'm getting an error. I'm guessing that's because .data
doesn't really work outside of the tidyverse functions?
some_func = function(prefix) {
case_when(
!is.na(.data[[ sprintf("%s_TOT", prefix) ]]) ~ .data[[ sprintf("%s_TOT", prefix) ]],
!is.na(.data[[ sprintf("%s_A", prefix) ]]) ~ .data[[ sprintf("%s_A", prefix) ]],
!is.na(.data[[ sprintf("%s_B", prefix) ]]) ~ .data[[ sprintf("%s_B", prefix) ]]
)
}
Something else I'm curious about is making an even more general case_when
generator. In the examples thus far, it's only the names (prefix) of the columns that are changing. What if I wanted to
high_W, high_X, high_Y, high_Z, low_W, low_X, low_Y, low_Z, .......
) and so make a character vector of suffixes an argument of some_func
!is.na(.data[[ . ]]) ~ .data[[ . ]]
for all the conditions, but what if I wanted to make this an argument of some_func
? For example, !is.na(.data[[ . ]]) ~ sprintf("%s is missing", .)
I'd be happy with just getting it to work with different prefixes but it'd be very cool to understand how I could achieve something even more general with arbitrary (but common) suffixes and arbitrary formulae such that I can do some_func(prefix, suffixes, formula)
.
Here is a custom case_when
function that you can call with purrr::reduce
and a vector of strings parts of your variable names (in the example c("low", "high")
:
library(dplyr)
library(purrr)
my_case_when <- function(df, x) {
mutate(df,
"ans_{x}" := case_when(
!is.na(!! sym(paste0(x, "_TOT"))) ~ !! sym(paste0(x, "_TOT")),
!is.na(!! sym(paste0(x, "_A"))) ~ !! sym(paste0(x, "_A")),
!is.na(!! sym(paste0(x, "_B"))) ~ !! sym(paste0(x, "_B"))
)
)
}
test_df %>%
reduce(c("low", "high"), my_case_when, .init = .)
#> # A tibble: 3 x 8
#> low_A low_TOT low_B high_A high_TOT high_B ans_low ans_high
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 5 NA 20 NA NA 60 5 60
#> 2 15 10 25 NA 40 20 10 40
#> 3 NA NA 30 10 NA NA 30 10
Created on 2021-07-22 by the reprex package (v0.3.0)
I also have a package on Github {dplyover} which is made for this kind of cases. For your example with more than two variables I would use dplyover::over
together with a special syntax to evaluate strings as variable names. We can further use dplyover::cut_names("_TOT")
to extract the string parts of the variable names that come before or after "_TOT"
(in the example this is "low"
and "high"
).
We can either use case_when
:
library(dplyr)
library(dplyover) # https://github.com/TimTeaFan/dplyover
test_df %>%
mutate(over(cut_names("_TOT"),
list(ans = ~ case_when(
!is.na(.("{.x}_TOT")) ~ .("{.x}_TOT"),
!is.na(.("{.x}_A")) ~ .("{.x}_A"),
!is.na(.("{.x}_B")) ~ .("{.x}_B")
)),
.names = "{fn}_{x}")
)
#> # A tibble: 3 x 8
#> low_A low_TOT low_B high_A high_TOT high_B ans_low ans_high
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 5 NA 20 NA NA 60 5 60
#> 2 15 10 25 NA 40 20 10 40
#> 3 NA NA 30 10 NA NA 30 10
Or somewhat easier coalesce
:
test_df %>%
mutate(over(cut_names("_TOT"),
list(ans = ~ coalesce(.("{.x}_TOT"),
.("{.x}_A"),
.("{.x}_B"))),
.names = "{fn}_{x}")
)
#> # A tibble: 3 x 8
#> low_A low_TOT low_B high_A high_TOT high_B ans_low ans_high
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 5 NA 20 NA NA 60 5 60
#> 2 15 10 25 NA 40 20 10 40
#> 3 NA NA 30 10 NA NA 30 10
Created on 2021-07-22 by the reprex package (v0.3.0)
At the risk of not answering the question, I think the easiest way to approach this is to just reshape and use coalesce()
. Your data structure requires two pivots either way (I think) but this requires no careful thinking about what prefixes are present.
library(tidyverse)
test_df <- tibble(
low_A = c(5, 15, NA),
low_TOT = c(NA, 10, NA),
low_B = c(20, 25, 30),
high_A = c(NA, NA, 10),
high_TOT = c(NA, 40, NA),
high_B = c(60, 20, NA)
)
test_df %>%
rowid_to_column() %>%
pivot_longer(cols = -rowid, names_to = c("prefix", "suffix"), names_sep = "_") %>%
pivot_wider(names_from = suffix, values_from = value) %>%
mutate(ans = coalesce(TOT, A, B)) %>%
pivot_longer(cols = c(-rowid, -prefix), names_to = "suffix") %>%
pivot_wider(names_from = c(prefix, suffix), names_sep = "_", values_from = value)
#> # A tibble: 3 x 9
#> rowid low_A low_TOT low_B low_ans high_A high_TOT high_B high_ans
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 5 NA 20 5 NA NA 60 60
#> 2 2 15 10 25 10 NA 40 20 40
#> 3 3 NA NA 30 30 10 NA NA 10
Note also that case_when
has no tidy evaluation, and so just not using mutate
simplifies your some_func
a lot. You already got an answer using !!sym
inside mutate, so here is a version that illustrates a simpler way. I prefer not to use tidyeval unless necessary because I want to use a mutate
chain, and here it's not really needed.
some_func <- function(df, prefix) {
ans <- str_c(prefix, "_ans")
TOT <- df[[str_c(prefix, "_TOT")]]
A <- df[[str_c(prefix, "_A")]]
B <- df[[str_c(prefix, "_B")]]
df[[ans]] <- case_when(
!is.na(TOT) ~ TOT,
!is.na(A) ~ A,
!is.na(B) ~ B
)
df
}
reduce(c("low", "high"), some_func, .init = test_df)
#> # A tibble: 3 x 8
#> low_A low_TOT low_B high_A high_TOT high_B low_ans high_ans
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 5 NA 20 NA NA 60 5 60
#> 2 15 10 25 NA 40 20 10 40
#> 3 NA NA 30 10 NA NA 30 10
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With