Some time ago they introduced a nice SQL-like alternative to ifelse
within dplyr
, i.e. case_when
.
Is there an equivalent in data.table
that would allow you to specify different conditions within one []
statement, without loading additional packages?
Example:
library(dplyr)
df <- data.frame(a = c("a", "b", "a"), b = c("b", "a", "a"))
df <- df %>% mutate(
new = case_when(
a == "a" & b == "b" ~ "c",
a == "b" & b == "a" ~ "d",
TRUE ~ "e")
)
a b new
1 a b c
2 b a d
3 a a e
It would certainly be very helpful and make code much more readable (one of the reasons why I keep using dplyr
in these cases).
Memory Usage (Efficiency)data. table is the most efficient when filtering rows. dplyr is far more efficient when summarizing by group while data. table was the least efficient.
Each dplyr verb must do some work to convert dplyr syntax to data. table syntax. This takes time proportional to the complexity of the input code, not the input data, so should be a negligible overhead for large datasets.
table ~6x faster. (Unverified) has data. table 75% faster on larger versions of a group/apply/sort while dplyr was 40% faster on the smaller ones (another SO question from comments, thanks danas).
In my benchmarking project, Base R sorts a dataset much faster than dplyr or data. table.
FYI, a more recent answer for those coming across this post 2019. data.table
versions above 1.13.0 have the fcase
function that can be used. Note that it is not a drop-in replacement for dplyr::case_when
as the syntax is different, but will be a "native" data.table
way of calculation.
# Lazy evaluation
x = 1:10
data.table::fcase(
x < 5L, 1L,
x >= 5L, 3L,
x == 5L, stop("provided value is an unexpected one!")
)
# [1] 1 1 1 1 3 3 3 3 3 3
dplyr::case_when(
x < 5L ~ 1L,
x >= 5L ~ 3L,
x == 5L ~ stop("provided value is an unexpected one!")
)
# Error in eval_tidy(pair$rhs, env = default_env) :
# provided value is an unexpected one!
# Benchmark
x = sample(1:100, 3e7, replace = TRUE) # 114 MB
microbenchmark::microbenchmark(
dplyr::case_when(
x < 10L ~ 0L,
x < 20L ~ 10L,
x < 30L ~ 20L,
x < 40L ~ 30L,
x < 50L ~ 40L,
x < 60L ~ 50L,
x > 60L ~ 60L
),
data.table::fcase(
x < 10L, 0L,
x < 20L, 10L,
x < 30L, 20L,
x < 40L, 30L,
x < 50L, 40L,
x < 60L, 50L,
x > 60L, 60L
),
times = 5L,
unit = "s")
# Unit: seconds
# expr min lq mean median uq max neval
# dplyr::case_when 11.57 11.71 12.22 11.82 12.00 14.02 5
# data.table::fcase 1.49 1.55 1.67 1.71 1.73 1.86 5
Source, data.table NEWS for 1.13.0, released (24 Jul 2020).
1) If the conditions are mutually exclusive with a default if all conditions are false then this works:
library(data.table)
DT <- as.data.table(df) # df is from question
DT[, new := c("e", "c", "d")[1 +
1 * (a == "a" & b == "b") +
2 * (a == "b" & b == "a")]
]
giving:
> DT
a b new
1: a b c
2: b a d
3: a a e
2) If the results of the conditions are numeric then it is even easier. For example suppose instead of c
and d
we want 10 and 17 with a default of 3. Then:
library(data.table)
DT <- as.data.table(df) # df is from question
DT[, new := 3 +
(10 - 3) * (a == "a" & b == "b") +
(17 - 3) * (a == "b" & b == "a")]
3) Note that adding a 1-liner is sufficient to implement this. It assumes that there is at least one TRUE leg for each row.
when <- function(...) names(match.call()[-1])[apply(cbind(...), 1, which.max)]
# test
DT[, new := when(c = a == 'a' & b == 'b',
d = a == 'b' & b == 'a',
e = TRUE)]
This is not really an answer, but a bit too long for a comment. If deemed inappropriate I'm happy to remove the post.
There exists an interesting post on RStudio Community that discusses options to use dplyr::case_when
without the usual tidyverse
dependencies.
To summarise, three alternatives seem to exist:
case_when
from dplyr
and build a new package lest
that depends only on base
.noplyr
, which "provides basic dplyr
and tidyr
functionality without the tidyverse dependencies".freebase
, a "A 'usethis'-like Package for Base R Pseudo-equivalents of 'tidyverse' Code", which might also be worth checking out. If it is only case_when
that you're after, I imagine lest
might be an attractive & minimal option in combination with data.table
.
Tyson Barrett recently made the package tidyfast
available (currently as version 0.1.0
) on GitHub, which provides function "dt_case_when
for dplyr::case_when()
syntax with the speed of data.table::fifelse()
".
There is also dtplyr, authored by Lionel Henry and maintained by Hadley Wickham, which "provides a data.table
backend for dplyr
. The goal of dtplyr
is to allow you to write dplyr
code that is automatically translated to the equivalent, but usually much faster, data.table
code.".
Here is a variation on @g-grothendieck's answer that works for non exclusive conditions :
DT[, new := c("c", "d", "e")[
apply(cbind(
a == "a" & b == "b",
a == "b" & b == "a",
TRUE), 1, which.max)]
]
DT
# a b new
# 1: a b c
# 2: b a d
# 3: a a e
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With