Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

analog of setdiff() using regular expressions

Suppose I want to exclude values matching a series of regular expressions from a character vector, in the same way that I would use setdiff() for fixed character strings, e.g.

value <- c("apple pie", "cat", "dog", "dogmatic", "no apples")
re_setdiff(value, c("^apple", "^dog"))
## desired results:
value[c(2,5)]
[1] "cat"       "no apples"

I know how I can code this by brute force (see my answer) but am wondering if there's a more efficient/more idiomatic way to do it (maybe something in stringi/stringr?), or something that's already in a (widely used) package?

like image 802
Ben Bolker Avatar asked Oct 22 '25 05:10

Ben Bolker


2 Answers

Here's the brute-force solution:

re_setdiff <- function(x, y, ...) {
   for (yy in y) {
      x <- grep(yy, x, invert = TRUE, value = TRUE, ...)
   }
   return(x)
}

I've left ... there in case someone wants to specify e.g. perl=TRUE. (I guess this could also be done more compactly/inscrutably with Reduce() ... ?)

like image 142
Ben Bolker Avatar answered Oct 23 '25 18:10

Ben Bolker


You are right that it can be done by Reduce

> value <- c("apple pie", "cat", "dog", "dogmatic", "no apples")

> exclude <- c("^apple", "^dog")

> Reduce(\(x, y) grep(y, x, value = TRUE, invert = TRUE), exclude, value)
[1] "cat"       "no apples"

Benchmarking

jofrhwld <- \(value, exclude) {
    str_subset(
        value,
        # concat into 1 regex
        pattern = str_c(exclude, collapse = "|"),
        negate = TRUE
    )
}

tic <- \(value, exclude) {
    Reduce(\(x, y) grep(y, x, value = TRUE, invert = TRUE), exclude, value)
}

darrentsai <- \(value, exclude) {
    value[!rowSums(sapply(exclude, grepl, value))]
}

benbolker <- function(x, y, ...) {
    for (yy in y) {
        x <- grep(yy, x, invert = TRUE, value = TRUE, ...)
    }
    return(x)
}

microbenchmark(
    jofrhwld = jofrhwld(value, exclude),
    tic = tic(value, exclude),
    darrentsai = darrentsai(value, exclude),
    benbolker = benbolker(value, exclude),
    unit = "relative",
    check = "equal"
)

shows

Unit: relative
       expr      min       lq     mean   median       uq      max neval
   jofrhwld 3.298173 3.365282 4.302031 3.267003 4.027422 11.35689   100
        tic 1.326923 1.334713 3.154308 1.366346 1.429391 14.58037   100
 darrentsai 2.548269 2.761838 3.840466 2.606786 2.723790 17.21892   100
  benbolker 1.000000 1.000000 1.000000 1.000000 1.000000  1.00000   100
like image 35
ThomasIsCoding Avatar answered Oct 23 '25 17:10

ThomasIsCoding



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!