I'm looking for a regular expression to grep whole words, including words separated by digits or underscore. \\b
considers digits and underscore as parts of words, not as boundaries.
For example, I'd like to catch MOUSE in "DOG MOUSE CAT", in "DOG MOUSE:CAT" but also in "DOG_MOUSE9CAT" and at the end or the beginning of an expression, as in "MOUSE9CAT" and "DOG_MOUSE". Basically, the boundary I'm looking for is any non-uppercase-alpha character plus beginning and end of line/expression (maybe missing some other cases caught by \\b
here).
I've tried:
"[[0-9_]\\b]MOUSE[[0-9_]\\b]"
"[[0-9_]|\\b]MOUSE[[0-9_]|\\b]"
"[$|[^A-Z]]MOUSE[^|[^A-Z]]"
"[?<=^|[^A-Z]]MOUSE[?=$|[^A-Z]]"
None of them work.
I'm actually looking for several words (based on a long vector of values), so the final result should look something like
grep(paste("\\b", paste(searchwords, collapse = "\\b|\\b"), "\\b"), targettext)
(with a different delimiter because \\b
is too restrictive for me).
(This is a similar question to the one asked by user Nick Sabbe in a comment here: Using grep in R to find strings as whole words (but not strings as part of words))
Word Boundary: \b The word boundary \b matches positions where one side is a word character (usually a letter, digit or underscore—but see below for variations across engines) and the other side is not a word character (for instance, it may be the beginning of the string or a space character).
Using regex \B-\B matches - between the word color - coded . Using \b-\b on the other hand matches the - in nine-digit and pass-key .
Simply put: \b allows you to perform a “whole words only” search using a regular expression in the form of \bword\b. A “word character” is a character that can be used to form words. All characters that are not “word characters” are “non-word characters”.
A word boundary, in most regex dialects, is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ( [0-9A-Za-z_] ). So, in the string "-12" , it would match before the 1 or after the 2. The dash is not a word character.
A regular expression is a search pattern that grep command matches in specified file or in provided text. In order to allow a user to express the regular expression in more customized way, grep assigns special meanings to few characters. These characters are known as Meta characters.
grep extended regex (search multiple words) The pipe sign (|) is used to search multiple words with grep command. To search multiple words with grep command, connect all of them with pipe sign and surround by quote signs. For example to search words abc, fgh, xyz, mno and jkl, use the search pattern "abc|fgh|xyz|mno|jkl".
Extended regular expression uses the Meta characters which were added later. Since later added characters are not defined in original implementation, grep treats them as regular characters unless we ask it to use them as Meta characters. To instruct grep command to use later added characters as Meta characters, an option –E is used.
Extended regular expressions A regular expression is a search pattern that grep command matches in specified file or in provided text. In order to allow a user to express the regular expression in more customized way, grep assigns special meanings to few characters. These characters are known as Meta characters.
Use PCRE regex with lookarounds:
grep("(?<![A-Z])MOUSE(?![A-Z])", targettext, perl=TRUE)
See the regex demo
The (?<![A-Z])
negative lookbehind will fail the match if the word is preceded with an uppercase ASCII letter and the negative lookahead (?![A-Z])
will fail the match if the word is followed with an uppercase ASCII letter.
To apply the lookarounds to all the alternatives you have, use an outer grouping (?:...|...)
.
See the R online demo:
> targettext <- c("DOG MOUSE CAT","DOG MOUSE:CAT","DOG_MOUSE9CAT","MOUSE9CAT","DOG_MOUSE")
> searchwords <- c("MOUSE","FROG")
> grep(paste0("(?<![A-Z])(?:", paste(searchwords, collapse = "|"), ")(?![A-Z])"), targettext, perl=TRUE)
[1] 1 2 3 4 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With