Efficient filtering through multiple columns by group

Tags:

Assume a dataset containing multiple rows per ID and multiple columns containing some codes stored as strings:

df <- data.frame(id = rep(1:3, each = 2),
                 var1 = c("X1", "Y1", "Y2", "Y3", "Z1", "Z2"),
                 var2 = c("Y1", "X2", "Y2", "Y3", "Z1", "Z2"),
                 var3 = c("Y1", "Y2", "X1", "Y3", "Z1", "Z2"),
                 stringsAsFactors = FALSE)

  id var1 var2 var3
1  1   X1   Y1   Y1
2  1   Y1   X2   Y2
3  2   Y2   Y2   X1
4  2   Y3   Y3   Y3
5  3   Z1   Z1   Z1
6  3   Z2   Z2   Z2

Now, assume that I want to filter out all IDs that have a specific code (here X) in any of the relevant columns. With dplyr and purrr, I could do:

df %>%
 group_by(id) %>%
 filter(all(reduce(.x = across(var1:var3, ~ !grepl("^X", .)), .f = `&`)))

     id var1  var2  var3 
  <int> <chr> <chr> <chr>
1     3 Z1    Z1    Z1   
2     3 Z2    Z2    Z2

It works fine, it's compact and it's easy to understand, however, it's fairly inefficient with big datasets (millions of IDs and tens of millions of observations). I would welcome any ideas for computationally more efficient code, using any library.

678

asked Jun 10 '21 13:06

tmfmnk

1 Answers

Some Possible Points for Speed

Try NOT using something like group by, i.e., group_by in dplyr or by = in data.table, since that will slow down your overall performance
If you have fixed objective pattern, e.g., starting with X, then substr might be more efficient than grepl with pattern ^X

Some Base R Approaches

It seems we can speed up further based on the @Waldi's fastest approach through the following one

TIC1 <- function() {
    subset(df, ave(rowSums(substr(as.matrix(df[, -1]), 1, 1) == "X") == 0, id, FUN = all))
}

TIC2 <- function() {
    subset(df, !id %in% id[rowSums(substr(as.matrix(df[, -1]), 1, 1) == "X") > 0])
}

TIC3 <- function() {
    subset(df, !id %in% id[do.call(pmax, lapply(df[-1], function(v) substr(v, 1, 1) == "X")) > 0])
}

Benchmarking

compared to answers from @Waldi and @EnricoSchumann:

microbenchmark(
    TIC1(),
    TIC2(),
    TIC3(),
    fun1(),
    fun2(),
    waldi_speed(),
    unit = "relative"
)

Unit: relative
          expr       min        lq      mean    median        uq       max
        TIC1()  3.385215  3.451424  3.488670  3.569668  3.684895  3.618991
        TIC2()  1.062116  1.084568  1.074789  1.090400  1.114443  1.027673
        TIC3()  1.077660  2.208734  2.185960  2.214180  2.293366  2.141994
        fun1()  1.166342  1.155096  1.169574  1.153223  1.207932  1.405530
        fun2()  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000
 waldi_speed() 26.218953 26.560429 26.373054 26.952997 27.396017 26.333575
 neval
   100
   100
   100
   100
   100
   100

given

n <- 5e4
df <- data.frame(
    id = rep(1:(n / 2), each = 2, length.out = n),
    var1 = mapply(paste0, LETTERS[23 + sample(1:3, n, replace = T)], sample(1:3, n, replace = T)),
    var2 = mapply(paste0, LETTERS[23 + sample(1:3, n, replace = T)], sample(1:3, n, replace = T)),
    var3 = mapply(paste0, LETTERS[23 + sample(1:3, n, replace = T)], sample(1:3, n, replace = T)),
    stringsAsFactors = FALSE
)

TIC1 <- function() {
    subset(df, ave(rowSums(substr(as.matrix(df[, -1]), 1, 1) == "X") == 0, id, FUN = all))
}

TIC2 <- function() {
    subset(df, !id %in% id[rowSums(substr(as.matrix(df[, -1]), 1, 1) == "X") > 0])
}

TIC3 <- function() {
    subset(df, !id %in% id[do.call(pmax, lapply(df[-1], function(v) substr(v, 1, 1) == "X")) > 0])
}


waldi_speed <- function() {
    setDT(df)
    df[df[, .(keep = .I[!any(grepl("X", .SD))]), by = id, .SDcols = patterns("var")]$keep]
}


repeated_or <- function(...) {
    L <- list(...)
    ans <- L[[1L]]
    if (...length() > 1L) {
          for (i in seq.int(2L, ...length())) {
                ans <- ans | L[[i]]
            }
      }
    ans
}

fun1 <- function() {
    ## using a pattern
    m <- lapply(df[, -1], grepl, pattern = "^X", perl = TRUE)
    df[!df$id %in% df$id[do.call(repeated_or, m)], ]
}

fun2 <- function() {
    ## using a fixed string
    m <- lapply(df[, -1], function(x) substr(x, 1, 1) == "X")
    df[!df$id %in% df$id[do.call(repeated_or, m)], ]
}

190

answered Sep 18 '22 14:09

ThomasIsCoding

Related questions
                            
                                Avoiding multiple for-loops in R to calculate a matrix
                            
                                Scatterplot: Error in FUN(X[[i]], ...) : object 'Group' not found
                            
                                How to use map from purrr with dplyr::mutate to create multiple new columns based on column pairs
                            
                                Converting between matrix subscripts and linear indices (like ind2sub/sub2ind in matlab)
                            
                                Challenge: recoding a data.frame() — make it faster
                            
                                Replace NaN values in a list with zero (0)
                            
                                Calculating length of 95%-CI using dplyr
                            
                                Color schemes in R?
                            
                                How to plot a subset of a data frame in R?
                            
                                Closing active connections using RMySQL
                            
                                Function to split a matrix into sub-matrices in R
                            
                                Faster ways to calculate frequencies and cast from long to wide
                            
                                Assignment in R language
                            
                                How to change the background color of the Shiny Dashboard Body
                            
                                add column values based on other columns in data frame using for and if
                            
                                Cannot build R package "png" Fedora 20
                            
                                Remove all text between two brackets
                            
                                sf: Write Lat/Long from geometry into separate column and keep ID column
                            
                                RStudio Shiny renderDataTable font size
                            
                                Is it possible to skip NA values in "+" operator?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficient filtering through multiple columns by group

Tags:

performance

regex

dataframe

r

filtering