Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R grep: Match one string against multiple patterns

Tags:

regex

r

In R, grep usually matches a vector of multiple strings against one regexp.

Q: Is there a possibility to match a single string against multiple regexps? (without looping through each single regexp pattern)?

Some background:

I have 7000+ keywords as indicators for several categories. I cannot change that keyword dictionary. The dictionary has following structure (keywords in col 1, numbers indicate categories where these keywords belong to):

ab  10  37  41 abbrach*    38 abbreche    39 abbrich*    39 abend*  37 abendessen* 60  63 aber    20  23  45 abermals    37 

Concatenating so many keywords with "|" is not a feasible way (and I wouldn't know which of the keywords generated the hit). Also, just reversing "patterns" and "strings" does not work, as the patterns have truncations, which wouldn't work the other way round.

[related question, other programming language]

like image 963
Felix S Avatar asked Mar 02 '12 17:03

Felix S


People also ask

How do I grep multiple patterns in R?

We can also use grep and grepl to check for multiple character patterns in our vector of character strings. We simply need to insert an |-operator between the patterns we want to search for. As you can see, both functions where searching for multiple pattern in the previous R code (i.e. “a” or “c”).

How do I use grep to search for multiple patterns?

If you want to find exact matches for multiple patterns, pass the -w flag to the grep command. As you can see, the results are different. The first command shows all lines with the strings you used. The second command shows how to grep exact matches for multiple strings.

What is the difference between grep and Grepl in R?

The grep and grepl functions use regular expressions or literal values as patterns to conduct pattern matching on a character vector. The grep returns indices of matched items or matched items themselves while grepl returns a logical vector with TRUE to represent a match and FALSE otherwise.


1 Answers

What about applying the regexpr function over a vector of keywords?

keywords <- c("dog", "cat", "bird")  strings <- c("Do you have a dog?", "My cat ate by bird.", "Let's get icecream!")  sapply(keywords, regexpr, strings, ignore.case=TRUE)       dog cat bird [1,]  15  -1   -1 [2,]  -1   4   15 [3,]  -1  -1   -1      sapply(keywords, regexpr, strings[1], ignore.case=TRUE)   dog  cat bird    15   -1   -1  

Values returned are the position of the first character in the match, with -1 meaning no match.

If the position of the match is irrelevant, use grepl instead:

sapply(keywords, grepl, strings, ignore.case=TRUE)         dog   cat  bird [1,]  TRUE FALSE FALSE [2,] FALSE  TRUE  TRUE [3,] FALSE FALSE FALSE 

Update: This runs relatively quick on my system, even with a large number of keywords:

# Available on most *nix systems words <- scan("/usr/share/dict/words", what="") length(words) [1] 234936  system.time(matches <- sapply(words, grepl, strings, ignore.case=TRUE))     user  system elapsed    7.495   0.155   7.596   dim(matches) [1]      3 234936 
like image 128
danpelota Avatar answered Sep 24 '22 00:09

danpelota