Assume a dataset containing multiple rows per ID and multiple columns containing some codes stored as strings:
df <- data.frame(id = rep(1:3, each = 2),
var1 = c("X1", "Y1", "Y2", "Y3", "Z1", "Z2"),
var2 = c("Y1", "X2", "Y2", "Y3", "Z1", "Z2"),
var3 = c("Y1", "Y2", "X1", "Y3", "Z1", "Z2"),
stringsAsFactors = FALSE)
id var1 var2 var3
1 1 X1 Y1 Y1
2 1 Y1 X2 Y2
3 2 Y2 Y2 X1
4 2 Y3 Y3 Y3
5 3 Z1 Z1 Z1
6 3 Z2 Z2 Z2
Now, assume that I want to filter out all IDs that have a specific code (here X
) in any of the relevant columns. With dplyr
and purrr
, I could do:
df %>%
group_by(id) %>%
filter(all(reduce(.x = across(var1:var3, ~ !grepl("^X", .)), .f = `&`)))
id var1 var2 var3
<int> <chr> <chr> <chr>
1 3 Z1 Z1 Z1
2 3 Z2 Z2 Z2
It works fine, it's compact and it's easy to understand, however, it's fairly inefficient with big datasets (millions of IDs and tens of millions of observations). I would welcome any ideas for computationally more efficient code, using any library.
Select the columns of the range or table that have filters applied, and then on the Data tab, click Filter. Select the columns of the range or table that have filters applied, and then on the Data tab, click Filter.
group_by
in dplyr
or by =
in data.table
, since that will slow down your overall performanceX
, then substr
might be more efficient than grepl
with pattern ^X
It seems we can speed up further based on the @Waldi's fastest approach through the following one
TIC1 <- function() {
subset(df, ave(rowSums(substr(as.matrix(df[, -1]), 1, 1) == "X") == 0, id, FUN = all))
}
or
TIC2 <- function() {
subset(df, !id %in% id[rowSums(substr(as.matrix(df[, -1]), 1, 1) == "X") > 0])
}
or
TIC3 <- function() {
subset(df, !id %in% id[do.call(pmax, lapply(df[-1], function(v) substr(v, 1, 1) == "X")) > 0])
}
compared to answers from @Waldi and @EnricoSchumann:
microbenchmark(
TIC1(),
TIC2(),
TIC3(),
fun1(),
fun2(),
waldi_speed(),
unit = "relative"
)
Unit: relative
expr min lq mean median uq max
TIC1() 3.385215 3.451424 3.488670 3.569668 3.684895 3.618991
TIC2() 1.062116 1.084568 1.074789 1.090400 1.114443 1.027673
TIC3() 1.077660 2.208734 2.185960 2.214180 2.293366 2.141994
fun1() 1.166342 1.155096 1.169574 1.153223 1.207932 1.405530
fun2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
waldi_speed() 26.218953 26.560429 26.373054 26.952997 27.396017 26.333575
neval
100
100
100
100
100
100
given
n <- 5e4
df <- data.frame(
id = rep(1:(n / 2), each = 2, length.out = n),
var1 = mapply(paste0, LETTERS[23 + sample(1:3, n, replace = T)], sample(1:3, n, replace = T)),
var2 = mapply(paste0, LETTERS[23 + sample(1:3, n, replace = T)], sample(1:3, n, replace = T)),
var3 = mapply(paste0, LETTERS[23 + sample(1:3, n, replace = T)], sample(1:3, n, replace = T)),
stringsAsFactors = FALSE
)
TIC1 <- function() {
subset(df, ave(rowSums(substr(as.matrix(df[, -1]), 1, 1) == "X") == 0, id, FUN = all))
}
TIC2 <- function() {
subset(df, !id %in% id[rowSums(substr(as.matrix(df[, -1]), 1, 1) == "X") > 0])
}
TIC3 <- function() {
subset(df, !id %in% id[do.call(pmax, lapply(df[-1], function(v) substr(v, 1, 1) == "X")) > 0])
}
waldi_speed <- function() {
setDT(df)
df[df[, .(keep = .I[!any(grepl("X", .SD))]), by = id, .SDcols = patterns("var")]$keep]
}
repeated_or <- function(...) {
L <- list(...)
ans <- L[[1L]]
if (...length() > 1L) {
for (i in seq.int(2L, ...length())) {
ans <- ans | L[[i]]
}
}
ans
}
fun1 <- function() {
## using a pattern
m <- lapply(df[, -1], grepl, pattern = "^X", perl = TRUE)
df[!df$id %in% df$id[do.call(repeated_or, m)], ]
}
fun2 <- function() {
## using a fixed string
m <- lapply(df[, -1], function(x) substr(x, 1, 1) == "X")
df[!df$id %in% df$id[do.call(repeated_or, m)], ]
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With