Some time ago they introduced a nice SQL-like alternative to ifelse within dplyr, i.e. case_when.
Is there an equivalent in data.table that would allow you to specify different conditions within one [] statement, without loading additional packages?
Example:
library(dplyr)
df <- data.frame(a = c("a", "b", "a"), b = c("b", "a", "a"))
df <- df %>% mutate(
new = case_when(
a == "a" & b == "b" ~ "c",
a == "b" & b == "a" ~ "d",
TRUE ~ "e")
)
a b new
1 a b c
2 b a d
3 a a e
It would certainly be very helpful and make code much more readable (one of the reasons why I keep using dplyr in these cases).
Memory Usage (Efficiency)data. table is the most efficient when filtering rows. dplyr is far more efficient when summarizing by group while data. table was the least efficient.
Each dplyr verb must do some work to convert dplyr syntax to data. table syntax. This takes time proportional to the complexity of the input code, not the input data, so should be a negligible overhead for large datasets.
table ~6x faster. (Unverified) has data. table 75% faster on larger versions of a group/apply/sort while dplyr was 40% faster on the smaller ones (another SO question from comments, thanks danas).
In my benchmarking project, Base R sorts a dataset much faster than dplyr or data. table.
FYI, a more recent answer for those coming across this post 2019. data.table versions above 1.13.0 have the fcase function that can be used. Note that it is not a drop-in replacement for dplyr::case_when as the syntax is different, but will be a "native" data.table way of calculation.
# Lazy evaluation
x = 1:10
data.table::fcase(
x < 5L, 1L,
x >= 5L, 3L,
x == 5L, stop("provided value is an unexpected one!")
)
# [1] 1 1 1 1 3 3 3 3 3 3
dplyr::case_when(
x < 5L ~ 1L,
x >= 5L ~ 3L,
x == 5L ~ stop("provided value is an unexpected one!")
)
# Error in eval_tidy(pair$rhs, env = default_env) :
# provided value is an unexpected one!
# Benchmark
x = sample(1:100, 3e7, replace = TRUE) # 114 MB
microbenchmark::microbenchmark(
dplyr::case_when(
x < 10L ~ 0L,
x < 20L ~ 10L,
x < 30L ~ 20L,
x < 40L ~ 30L,
x < 50L ~ 40L,
x < 60L ~ 50L,
x > 60L ~ 60L
),
data.table::fcase(
x < 10L, 0L,
x < 20L, 10L,
x < 30L, 20L,
x < 40L, 30L,
x < 50L, 40L,
x < 60L, 50L,
x > 60L, 60L
),
times = 5L,
unit = "s")
# Unit: seconds
# expr min lq mean median uq max neval
# dplyr::case_when 11.57 11.71 12.22 11.82 12.00 14.02 5
# data.table::fcase 1.49 1.55 1.67 1.71 1.73 1.86 5
Source, data.table NEWS for 1.13.0, released (24 Jul 2020).
1) If the conditions are mutually exclusive with a default if all conditions are false then this works:
library(data.table)
DT <- as.data.table(df) # df is from question
DT[, new := c("e", "c", "d")[1 +
1 * (a == "a" & b == "b") +
2 * (a == "b" & b == "a")]
]
giving:
> DT
a b new
1: a b c
2: b a d
3: a a e
2) If the results of the conditions are numeric then it is even easier. For example suppose instead of c and d we want 10 and 17 with a default of 3. Then:
library(data.table)
DT <- as.data.table(df) # df is from question
DT[, new := 3 +
(10 - 3) * (a == "a" & b == "b") +
(17 - 3) * (a == "b" & b == "a")]
3) Note that adding a 1-liner is sufficient to implement this. It assumes that there is at least one TRUE leg for each row.
when <- function(...) names(match.call()[-1])[apply(cbind(...), 1, which.max)]
# test
DT[, new := when(c = a == 'a' & b == 'b',
d = a == 'b' & b == 'a',
e = TRUE)]
This is not really an answer, but a bit too long for a comment. If deemed inappropriate I'm happy to remove the post.
There exists an interesting post on RStudio Community that discusses options to use dplyr::case_when without the usual tidyverse dependencies.
To summarise, three alternatives seem to exist:
case_when from dplyr and build a new package lest that depends only on base.noplyr, which "provides basic dplyr and tidyr functionality without the tidyverse dependencies".freebase, a "A 'usethis'-like Package for Base R Pseudo-equivalents of 'tidyverse' Code", which might also be worth checking out. If it is only case_when that you're after, I imagine lest might be an attractive & minimal option in combination with data.table.
Tyson Barrett recently made the package tidyfast available (currently as version 0.1.0) on GitHub, which provides function "dt_case_when for dplyr::case_when() syntax with the speed of data.table::fifelse()".
There is also dtplyr, authored by Lionel Henry and maintained by Hadley Wickham, which "provides a data.table backend for dplyr. The goal of dtplyr is to allow you to write dplyr code that is automatically translated to the equivalent, but usually much faster, data.table code.".
Here is a variation on @g-grothendieck's answer that works for non exclusive conditions :
DT[, new := c("c", "d", "e")[
apply(cbind(
a == "a" & b == "b",
a == "b" & b == "a",
TRUE), 1, which.max)]
]
DT
# a b new
# 1: a b c
# 2: b a d
# 3: a a e
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With