Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: fast (conditional) subsetting where feasible

I would like to subset rows of my data

library(data.table); set.seed(333); n <- 100
dat <- data.table(id=1:n, x=runif(n,100,120), y=runif(n,200,220), z=runif(n,300,320))

> head(dat)
   id        x        y        z
1:  1 109.3400 208.6732 308.7595
2:  2 101.6920 201.0989 310.1080
3:  3 119.4697 217.8550 313.9384
4:  4 111.4261 205.2945 317.3651
5:  5 100.4024 212.2826 305.1375
6:  6 114.4711 203.6988 319.4913

in several stages. I am aware that I could apply subset(.) sequentially to achieve this.

> s <- subset(dat, x>119)
> s <- subset(s, y>219)
> subset(s, z>315)
   id        x        y        z
1: 55 119.2634 219.0044 315.6556

My problem is that I need to automate this and it might happen that the subset is empty. In this case, I would want to skip the step(s) that result in an empty set. For example, if my data was

dat2 <- dat[1:50]
> s <-subset(dat2,x>119)
> s
   id        x        y        z
1:  3 119.4697 217.8550 313.9384
2: 50 119.2519 214.2517 318.8567

the second step subset(s, y>219) would come up empty but I would still want to apply the third step subset(s,z>315). Is there a way to apply a subset-command only if it results in a non-empty set? I imagine something like subset(s, y>219, nonzero=TRUE). I would want to avoid constructions like

s <- dat
if(nrow(subset(s, x>119))>0){s <- subset(s, x>119)}
if(nrow(subset(s, y>219))>0){s <- subset(s, y>219)}
if(nrow(subset(s, z>318))>0){s <- subset(s, z>319)}

because I fear the if-then jungle would be rather slow, especially since I need to apply all of this to different data.tables within a list using lapply(.). That's why I am hoping to find a solution optimized for speed.

PS. I only chose subset(.) for clarity, solutions with e.g. data.table would be just as welcome if not more so.

like image 339
bumblebee Avatar asked Aug 09 '19 12:08

bumblebee


2 Answers

I agree with Konrad's answer that this should throw a warning or at least report what happens somehow. Here's a data.table way that will take advantage of indices (see package vignettes for details):

f = function(x, ..., verbose=FALSE){
  L   = substitute(list(...))[-1]
  mon = data.table(cond = as.character(L))[, skip := FALSE]

  for (i in seq_along(L)){
    d = eval( substitute(x[cond, verbose=v], list(cond = L[[i]], v = verbose)) )
    if (nrow(d)){
      x = d
    } else {
      mon[i, skip := TRUE]
    }    
  }
  print(mon)
  return(x)
}

Usage

> f(dat, x > 119, y > 219, y > 1e6)
        cond  skip
1:   x > 119 FALSE
2:   y > 219 FALSE
3: y > 1e+06  TRUE
   id        x        y        z
1: 55 119.2634 219.0044 315.6556

The verbose option will print extra info provided by data.table package, so you can see when indices are being used. For example, with f(dat, x == 119, verbose=TRUE), I see it.

because I fear the if-then jungle would be rather slow, especially since I need to apply all of this to different data.tables within a list using lapply(.).

If it's for non-interactive use, maybe better to have the function return list(mon = mon, x = x) to more easily keep track of what the query was and what happened. Also, the verbose console output could be captured and returned.

like image 159
Frank Avatar answered Oct 29 '22 20:10

Frank


An interesting approach could be developed using modified filter function offered in dplyr. In case of conditions not being met the non_empty_filter filter function returns original data set.

Notes

  • IMHO, this is fairly non-standard behaviour and should be reported via warning. Of course, this can be removed and has no bearing on the function results.

Function

library(tidyverse)
library(rlang) # enquo
non_empty_filter <- function(df, expr) {
    expr <- enquo(expr)

    res <- df %>% filter(!!expr)

    if (nrow(res) > 0) {
        return(res)
    } else {
        # Indicate that filter is not applied
        warning("No rows meeting conditon")
        return(df)
    }
}

Condition met

Behaviour: Returning one row for which the condition is met.

dat %>%
    non_empty_filter(x > 119 & y > 219)

Results

# id        x        y        z
# 1 55 119.2634 219.0044 315.6556

Condition not met

Behaviour: Returning the full data set as the whole condition is not met due to y > 1e6.

dat %>%
    non_empty_filter(x > 119 & y > 219 & y > 1e6)

Results

# id        x        y        z
# 1:   1 109.3400 208.6732 308.7595
# 2:   2 101.6920 201.0989 310.1080
# 3:   3 119.4697 217.8550 313.9384
# 4:   4 111.4261 205.2945 317.3651
# 5:   5 100.4024 212.2826 305.1375
# 6:   6 114.4711 203.6988 319.4913
# 7:   7 112.1879 209.5716 319.6732
# 8:   8 106.1344 202.2453 312.9427
# 9:   9 101.2702 210.5923 309.2864
# 10:  10 106.1071 211.8266 301.0645

Condition met/not met one-by-one

Behaviour: Skipping filter that would return an empty data set.

dat %>%
    non_empty_filter(y > 1e6) %>% 
    non_empty_filter(x > 119) %>% 
    non_empty_filter(y > 219)

Results

# id        x        y        z
# 1 55 119.2634 219.0044 315.6556
like image 22
Konrad Avatar answered Oct 29 '22 22:10

Konrad