Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient filtering through multiple columns by group

Assume a dataset containing multiple rows per ID and multiple columns containing some codes stored as strings:

df <- data.frame(id = rep(1:3, each = 2),
                 var1 = c("X1", "Y1", "Y2", "Y3", "Z1", "Z2"),
                 var2 = c("Y1", "X2", "Y2", "Y3", "Z1", "Z2"),
                 var3 = c("Y1", "Y2", "X1", "Y3", "Z1", "Z2"),
                 stringsAsFactors = FALSE)

  id var1 var2 var3
1  1   X1   Y1   Y1
2  1   Y1   X2   Y2
3  2   Y2   Y2   X1
4  2   Y3   Y3   Y3
5  3   Z1   Z1   Z1
6  3   Z2   Z2   Z2

Now, assume that I want to filter out all IDs that have a specific code (here X) in any of the relevant columns. With dplyr and purrr, I could do:

df %>%
 group_by(id) %>%
 filter(all(reduce(.x = across(var1:var3, ~ !grepl("^X", .)), .f = `&`)))

     id var1  var2  var3 
  <int> <chr> <chr> <chr>
1     3 Z1    Z1    Z1   
2     3 Z2    Z2    Z2 

It works fine, it's compact and it's easy to understand, however, it's fairly inefficient with big datasets (millions of IDs and tens of millions of observations). I would welcome any ideas for computationally more efficient code, using any library.

like image 678
tmfmnk Avatar asked Jun 10 '21 13:06

tmfmnk


People also ask

How do I filter all columns in Excel?

Select the columns of the range or table that have filters applied, and then on the Data tab, click Filter. Select the columns of the range or table that have filters applied, and then on the Data tab, click Filter.


1 Answers

Some Possible Points for Speed

  • Try NOT using something like group by, i.e., group_by in dplyr or by = in data.table, since that will slow down your overall performance
  • If you have fixed objective pattern, e.g., starting with X, then substr might be more efficient than grepl with pattern ^X

Some Base R Approaches

It seems we can speed up further based on the @Waldi's fastest approach through the following one

TIC1 <- function() {
    subset(df, ave(rowSums(substr(as.matrix(df[, -1]), 1, 1) == "X") == 0, id, FUN = all))
}

or

TIC2 <- function() {
    subset(df, !id %in% id[rowSums(substr(as.matrix(df[, -1]), 1, 1) == "X") > 0])
}

or

TIC3 <- function() {
    subset(df, !id %in% id[do.call(pmax, lapply(df[-1], function(v) substr(v, 1, 1) == "X")) > 0])
}

Benchmarking

compared to answers from @Waldi and @EnricoSchumann:

microbenchmark(
    TIC1(),
    TIC2(),
    TIC3(),
    fun1(),
    fun2(),
    waldi_speed(),
    unit = "relative"
)

Unit: relative
          expr       min        lq      mean    median        uq       max
        TIC1()  3.385215  3.451424  3.488670  3.569668  3.684895  3.618991
        TIC2()  1.062116  1.084568  1.074789  1.090400  1.114443  1.027673
        TIC3()  1.077660  2.208734  2.185960  2.214180  2.293366  2.141994
        fun1()  1.166342  1.155096  1.169574  1.153223  1.207932  1.405530
        fun2()  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000
 waldi_speed() 26.218953 26.560429 26.373054 26.952997 27.396017 26.333575
 neval
   100
   100
   100
   100
   100
   100

given

n <- 5e4
df <- data.frame(
    id = rep(1:(n / 2), each = 2, length.out = n),
    var1 = mapply(paste0, LETTERS[23 + sample(1:3, n, replace = T)], sample(1:3, n, replace = T)),
    var2 = mapply(paste0, LETTERS[23 + sample(1:3, n, replace = T)], sample(1:3, n, replace = T)),
    var3 = mapply(paste0, LETTERS[23 + sample(1:3, n, replace = T)], sample(1:3, n, replace = T)),
    stringsAsFactors = FALSE
)

TIC1 <- function() {
    subset(df, ave(rowSums(substr(as.matrix(df[, -1]), 1, 1) == "X") == 0, id, FUN = all))
}

TIC2 <- function() {
    subset(df, !id %in% id[rowSums(substr(as.matrix(df[, -1]), 1, 1) == "X") > 0])
}

TIC3 <- function() {
    subset(df, !id %in% id[do.call(pmax, lapply(df[-1], function(v) substr(v, 1, 1) == "X")) > 0])
}


waldi_speed <- function() {
    setDT(df)
    df[df[, .(keep = .I[!any(grepl("X", .SD))]), by = id, .SDcols = patterns("var")]$keep]
}


repeated_or <- function(...) {
    L <- list(...)
    ans <- L[[1L]]
    if (...length() > 1L) {
          for (i in seq.int(2L, ...length())) {
                ans <- ans | L[[i]]
            }
      }
    ans
}

fun1 <- function() {
    ## using a pattern
    m <- lapply(df[, -1], grepl, pattern = "^X", perl = TRUE)
    df[!df$id %in% df$id[do.call(repeated_or, m)], ]
}

fun2 <- function() {
    ## using a fixed string
    m <- lapply(df[, -1], function(x) substr(x, 1, 1) == "X")
    df[!df$id %in% df$id[do.call(repeated_or, m)], ]
}
like image 190
ThomasIsCoding Avatar answered Sep 18 '22 14:09

ThomasIsCoding