In R, grep usually matches a vector of multiple strings against one regexp.
Q: Is there a possibility to match a single string against multiple regexps? (without looping through each single regexp pattern)?
Some background:
I have 7000+ keywords as indicators for several categories. I cannot change that keyword dictionary. The dictionary has following structure (keywords in col 1, numbers indicate categories where these keywords belong to):
ab 10 37 41 abbrach* 38 abbreche 39 abbrich* 39 abend* 37 abendessen* 60 63 aber 20 23 45 abermals 37
Concatenating so many keywords with "|" is not a feasible way (and I wouldn't know which of the keywords generated the hit). Also, just reversing "patterns" and "strings" does not work, as the patterns have truncations, which wouldn't work the other way round.
[related question, other programming language]
We can also use grep and grepl to check for multiple character patterns in our vector of character strings. We simply need to insert an |-operator between the patterns we want to search for. As you can see, both functions where searching for multiple pattern in the previous R code (i.e. “a” or “c”).
If you want to find exact matches for multiple patterns, pass the -w flag to the grep command. As you can see, the results are different. The first command shows all lines with the strings you used. The second command shows how to grep exact matches for multiple strings.
The grep and grepl functions use regular expressions or literal values as patterns to conduct pattern matching on a character vector. The grep returns indices of matched items or matched items themselves while grepl returns a logical vector with TRUE to represent a match and FALSE otherwise.
What about applying the regexpr function over a vector of keywords?
keywords <- c("dog", "cat", "bird") strings <- c("Do you have a dog?", "My cat ate by bird.", "Let's get icecream!") sapply(keywords, regexpr, strings, ignore.case=TRUE) dog cat bird [1,] 15 -1 -1 [2,] -1 4 15 [3,] -1 -1 -1 sapply(keywords, regexpr, strings[1], ignore.case=TRUE) dog cat bird 15 -1 -1
Values returned are the position of the first character in the match, with -1
meaning no match.
If the position of the match is irrelevant, use grepl
instead:
sapply(keywords, grepl, strings, ignore.case=TRUE) dog cat bird [1,] TRUE FALSE FALSE [2,] FALSE TRUE TRUE [3,] FALSE FALSE FALSE
Update: This runs relatively quick on my system, even with a large number of keywords:
# Available on most *nix systems words <- scan("/usr/share/dict/words", what="") length(words) [1] 234936 system.time(matches <- sapply(words, grepl, strings, ignore.case=TRUE)) user system elapsed 7.495 0.155 7.596 dim(matches) [1] 3 234936
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With