Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

subsetting a data.table based on a named list

I'm trying to subset a given data.table

DT <- data.table(
  a = c(1:20),
  b = (3:4),
  c = (5:14),
  d = c(1:4)
)

within a function by a parameter which is a named list

param <- list(a = 1:10,
              b = 2:3,
              c = c(5, 7, 10))

I am maybe a bit stuck here but I certainly do not want implement something ugly like this. Especially since its not very dynamic.

DT[(if (!is.null(param$a))
  a %in% param$a
  else
    TRUE)
  &
    (if (!is.null(param$b))
      b %in% param$b
     else
       TRUE)
  &
    (if (!is.null(param$c))
      c %in%  param$c
     else
       TRUE)
  &
    (if (!is.null(param$d))
      d %in% param$d
     else
       TRUE)]
   a b c d
1: 1 3 5 1
2: 3 3 7 3

Any ideas how to achieve this in an elegant way in data.table or base R using the names of the named list to subset the corresponding columns in the data.table with the associate values? Thanks!

EDIT

I performed a microbenchmark with some of the answers:

func_4 <- function(myp, DT) {
  myp    = Filter(Negate(is.null), param)

  exs = Map(function(var, val)
    call("%in%", var, val),
    var = sapply(names(myp), as.name),
    val = myp)
  exi = Reduce(function(x, y)
    call("&", x, y), exs)
  ex = call("[", x = as.name("DT"), i = exi)
  # eval(as.call(c(as.list(ex))))
  eval(ex)
}

microbenchmark(
  (DT[do.call(pmin, Map(`%in%`, DT[, names(param), with = FALSE], param)) == 1L]),
  (DT[rowSums(mapply(`%in%`, DT[, names(param), with = FALSE], param)) == length(param)]),
  (DT[do.call(CJ, param), on = names(param), nomatch = NULL]),
  (DT[expand.grid(param), on = names(param), nomatch = NULL]),
  (DT[DT[, all(mapply(`%in%`, .SD, param)), by = 1:nrow(DT), .SDcols = names(param)]$V1]),
  (func_4(myp = param, DT = DT)),
  times = 200)

   min        lq      mean   median        uq       max neval
  446.656  488.5365  565.5597  511.403  533.7785  7167.847   200
  454.120  516.3000  566.8617  538.146  561.8965  1840.982   200
 2433.450 2538.6075 2732.4749 2606.986 2704.5285 10302.085   200
 2478.595 2588.7240 2939.8625 2642.311 2743.9375 10722.578   200
 2648.707 2761.2475 3040.4926 2814.177 2903.8845 10334.822   200
 3243.040 3384.6220 3764.5087 3484.423 3596.9140 14873.898   200
like image 512
jaydee Avatar asked Apr 17 '19 13:04

jaydee


2 Answers

You can use the CJ (Cross Join) function from data.table to make a filtering table from the list.

lookup <- do.call(CJ, param)
head(lookup)
#    a b  c
# 1: 1 2  5
# 2: 1 2  7
# 3: 1 2 10
# 4: 1 3  5
# 5: 1 3  7
# 6: 1 3 10

DT[
    lookup,
    on = names(lookup),
    nomatch = NULL
]
#    a b c d
# 1: 1 3 5 1
# 2: 3 3 7 3

Note that nomatch = 0 means any combo in lookup that doesn't exist in DT won't return a row.

like image 157
Nathan Werth Avatar answered Jan 03 '23 19:01

Nathan Werth


Using Map we can do

DT[DT[, all(Map(`%in%`, .SD, param)), by = 1:nrow(DT)]$V1]
#   a b c d
#1: 1 3 5 1
#2: 3 3 7 3

For each row we check if all elements in DT are present in param.


Thanks to @Frank, this can be improved to

DT[DT[, all(mapply(`%in%`, .SD, param)), by = 1:nrow(DT), .SDcols=names(param)]$V1]
like image 45
markus Avatar answered Jan 03 '23 20:01

markus