Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table alternative for dplyr case_when

Some time ago they introduced a nice SQL-like alternative to ifelse within dplyr, i.e. case_when.

Is there an equivalent in data.table that would allow you to specify different conditions within one [] statement, without loading additional packages?

Example:

library(dplyr)

df <- data.frame(a = c("a", "b", "a"), b = c("b", "a", "a"))

df <- df %>% mutate(
    new = case_when(
    a == "a" & b == "b" ~ "c",
    a == "b" & b == "a" ~ "d",
    TRUE ~ "e")
    )

  a b new
1 a b   c
2 b a   d
3 a a   e

It would certainly be very helpful and make code much more readable (one of the reasons why I keep using dplyr in these cases).

like image 899
arg0naut91 Avatar asked Oct 28 '18 11:10

arg0naut91


People also ask

Is data table better than dplyr?

Memory Usage (Efficiency)data. table is the most efficient when filtering rows. dplyr is far more efficient when summarizing by group while data. table was the least efficient.

Does dplyr work with data table?

Each dplyr verb must do some work to convert dplyr syntax to data. table syntax. This takes time proportional to the complexity of the input code, not the input data, so should be a negligible overhead for large datasets.

How much faster is data table than dplyr?

table ~6x faster. (Unverified) has data. table 75% faster on larger versions of a group/apply/sort while dplyr was 40% faster on the smaller ones (another SO question from comments, thanks danas).

Is Base R faster than dplyr?

In my benchmarking project, Base R sorts a dataset much faster than dplyr or data. table.


4 Answers

FYI, a more recent answer for those coming across this post 2019. data.table versions above 1.13.0 have the fcase function that can be used. Note that it is not a drop-in replacement for dplyr::case_when as the syntax is different, but will be a "native" data.table way of calculation.

# Lazy evaluation
x = 1:10
data.table::fcase(
    x < 5L, 1L,
    x >= 5L, 3L,
    x == 5L, stop("provided value is an unexpected one!")
)
# [1] 1 1 1 1 3 3 3 3 3 3

dplyr::case_when(
    x < 5L ~ 1L,
    x >= 5L ~ 3L,
    x == 5L ~ stop("provided value is an unexpected one!")
)
# Error in eval_tidy(pair$rhs, env = default_env) :
#  provided value is an unexpected one!

# Benchmark
x = sample(1:100, 3e7, replace = TRUE) # 114 MB
microbenchmark::microbenchmark(
dplyr::case_when(
  x < 10L ~ 0L,
  x < 20L ~ 10L,
  x < 30L ~ 20L,
  x < 40L ~ 30L,
  x < 50L ~ 40L,
  x < 60L ~ 50L,
  x > 60L ~ 60L
),
data.table::fcase(
  x < 10L, 0L,
  x < 20L, 10L,
  x < 30L, 20L,
  x < 40L, 30L,
  x < 50L, 40L,
  x < 60L, 50L,
  x > 60L, 60L
),
times = 5L,
unit = "s")
# Unit: seconds
#               expr   min    lq  mean   median    uq    max neval
# dplyr::case_when   11.57 11.71 12.22    11.82 12.00  14.02     5
# data.table::fcase   1.49  1.55  1.67     1.71  1.73   1.86     5

Source, data.table NEWS for 1.13.0, released (24 Jul 2020).

like image 97
skedaddle_waznook Avatar answered Oct 22 '22 23:10

skedaddle_waznook


1) If the conditions are mutually exclusive with a default if all conditions are false then this works:

library(data.table)
DT <- as.data.table(df) # df is from question

DT[, new := c("e", "c", "d")[1 +
                             1 * (a == "a" & b == "b") + 
                             2 * (a == "b" & b == "a")]
]

giving:

> DT
   a b new
1: a b   c
2: b a   d
3: a a   e

2) If the results of the conditions are numeric then it is even easier. For example suppose instead of c and d we want 10 and 17 with a default of 3. Then:

library(data.table)
DT <- as.data.table(df) # df is from question

DT[, new := 3 + 
            (10 - 3) * (a == "a" & b == "b") + 
            (17 - 3) * (a == "b" & b == "a")]

3) Note that adding a 1-liner is sufficient to implement this. It assumes that there is at least one TRUE leg for each row.

when <- function(...) names(match.call()[-1])[apply(cbind(...), 1, which.max)]

# test
DT[, new := when(c = a == 'a' & b == 'b', 
                 d = a == 'b' & b == 'a', 
                 e = TRUE)]
like image 28
G. Grothendieck Avatar answered Oct 22 '22 23:10

G. Grothendieck


This is not really an answer, but a bit too long for a comment. If deemed inappropriate I'm happy to remove the post.

There exists an interesting post on RStudio Community that discusses options to use dplyr::case_when without the usual tidyverse dependencies.

To summarise, three alternatives seem to exist:

  1. Stefan Fleck isolated case_when from dplyr and build a new package lest that depends only on base.
  2. yonicd developed noplyr, which "provides basic dplyr and tidyr functionality without the tidyverse dependencies".
  3. Bob Rudis (hrbrmstr) is the creator of freebase, a "A 'usethis'-like Package for Base R Pseudo-equivalents of 'tidyverse' Code", which might also be worth checking out.

If it is only case_when that you're after, I imagine lest might be an attractive & minimal option in combination with data.table.


Update [29 October 2019]

Tyson Barrett recently made the package tidyfast available (currently as version 0.1.0) on GitHub, which provides function "dt_case_when for dplyr::case_when() syntax with the speed of data.table::fifelse()".

Update [25 February 2020]

There is also dtplyr, authored by Lionel Henry and maintained by Hadley Wickham, which "provides a data.table backend for dplyr. The goal of dtplyr is to allow you to write dplyr code that is automatically translated to the equivalent, but usually much faster, data.table code.".

like image 33
Maurits Evers Avatar answered Oct 22 '22 23:10

Maurits Evers


Here is a variation on @g-grothendieck's answer that works for non exclusive conditions :

DT[, new := c("c", "d", "e")[
  apply(cbind(
    a == "a" & b == "b", 
    a == "b" & b == "a",
    TRUE), 1, which.max)]
  ]

DT
#    a b new
# 1: a b   c
# 2: b a   d
# 3: a a   e
like image 41
Moody_Mudskipper Avatar answered Oct 22 '22 23:10

Moody_Mudskipper