I'm looking for a regular expression to grep whole words, including words separated by digits or underscore. <code>\\b</code> considers digits and underscore as parts of words, not as boundaries. For example, I'd like to catch MOUSE in "DOG MOUSE CAT", in "DOG MOUSE:CAT" but also in "DOG_MOUSE9CAT" and at the end or the beginning of an expression, as in "MOUSE9CAT" and "DOG_MOUSE". Basically, the boundary I'm looking for is any non-uppercase-alpha character plus beginning and end of line/expression (maybe missing some other cases caught by <code>\\b</code> here). I've tried: <pre class="prettyprint"><code>"[[0-9_]\\b]MOUSE[[0-9_]\\b]" "[[0-9_]|\\b]MOUSE[[0-9_]|\\b]" "[$|[^A-Z]]MOUSE[^|[^A-Z]]" "[?<=^|[^A-Z]]MOUSE[?=$|[^A-Z]]" </code></pre> None of them work. I'm actually looking for several words (based on a long vector of values), so the final result should look something like <pre class="prettyprint"><code>grep(paste("\\b", paste(searchwords, collapse = "\\b|\\b"), "\\b"), targettext) </code></pre> (with a different delimiter because <code>\\b</code> is too restrictive for me). (This is a similar question to the one asked by user Nick Sabbe in a comment here: Using grep in R to find strings as whole words (but not strings as part of words))

Use PCRE regex with lookarounds: <pre class="prettyprint"><code>grep("(?<![A-Z])MOUSE(?![A-Z])", targettext, perl=TRUE) </code></pre> See the regex demo The <code>(?<![A-Z])</code> negative lookbehind will fail the match if the word is preceded with an uppercase ASCII letter and the negative lookahead <code>(?![A-Z])</code> will fail the match if the word is followed with an uppercase ASCII letter. To apply the lookarounds to all the alternatives you have, use an outer grouping <code>(?:...|...)</code>. See the R online demo: <pre class="prettyprint"><code>> targettext <- c("DOG MOUSE CAT","DOG MOUSE:CAT","DOG_MOUSE9CAT","MOUSE9CAT","DOG_MOUSE") > searchwords <- c("MOUSE","FROG") > grep(paste0("(?<![A-Z])(?:", paste(searchwords, collapse = "|"), ")(?![A-Z])"), targettext, perl=TRUE) [1] 1 2 3 4 5 </code></pre>

Grep in R to find words with custom "extended" boundaries

Tags:

regex

r

I'm looking for a regular expression to grep whole words, including words separated by digits or underscore. \\b considers digits and underscore as parts of words, not as boundaries.

For example, I'd like to catch MOUSE in "DOG MOUSE CAT", in "DOG MOUSE:CAT" but also in "DOG_MOUSE9CAT" and at the end or the beginning of an expression, as in "MOUSE9CAT" and "DOG_MOUSE". Basically, the boundary I'm looking for is any non-uppercase-alpha character plus beginning and end of line/expression (maybe missing some other cases caught by \\b here).

I've tried:

"[[0-9_]\\b]MOUSE[[0-9_]\\b]"
"[[0-9_]|\\b]MOUSE[[0-9_]|\\b]"
"[$|[^A-Z]]MOUSE[^|[^A-Z]]"
"[?<=^|[^A-Z]]MOUSE[?=$|[^A-Z]]"

None of them work.

I'm actually looking for several words (based on a long vector of values), so the final result should look something like

grep(paste("\\b", paste(searchwords, collapse = "\\b|\\b"), "\\b"), targettext)

(with a different delimiter because \\b is too restrictive for me).

(This is a similar question to the one asked by user Nick Sabbe in a comment here: Using grep in R to find strings as whole words (but not strings as part of words))

286

asked Nov 25 '16 10:11

syre

1 Answers

Use PCRE regex with lookarounds:

grep("(?<![A-Z])MOUSE(?![A-Z])", targettext, perl=TRUE)

See the regex demo

The (?<![A-Z]) negative lookbehind will fail the match if the word is preceded with an uppercase ASCII letter and the negative lookahead (?![A-Z]) will fail the match if the word is followed with an uppercase ASCII letter.

To apply the lookarounds to all the alternatives you have, use an outer grouping (?:...|...).

See the R online demo:

> targettext <- c("DOG MOUSE CAT","DOG MOUSE:CAT","DOG_MOUSE9CAT","MOUSE9CAT","DOG_MOUSE")
> searchwords <- c("MOUSE","FROG")
> grep(paste0("(?<![A-Z])(?:", paste(searchwords, collapse = "|"), ")(?![A-Z])"), targettext, perl=TRUE)
[1] 1 2 3 4 5

124

answered Sep 27 '22 16:09

Wiktor Stribiżew

Related questions
                            
                                How to aggregate matrices within a list based on vector of names?
                            
                                Datatable (DT) Shiny R - select all found rows
                            
                                How to save a standalone R environment object
                            
                                Random sample of rows with "at least one from each" condition
                            
                                How to connect to remote PostgreSQL with R, certificate validation required
                            
                                In-place list modification without for loop in R
                            
                                DataTable warning table ajax error for some people but not all with shiny
                            
                                Cubic spline method for longitudinal series data?
                            
                                Verify object existence inside a function in R [duplicate]
                            
                                R S4 setMethod '[' distinguish missing argument?
                            
                                RStudio snippet not working
                            
                                Difference between factor and character variables running randomForest
                            
                                How to install a particular tagged commit of an R package
                            
                                Change raster values using spatial polygons
                            
                                na.locf fill NAs up to maxgap even if gap > maxgap, with groups
                            
                                Rstudio - keeping code folded when run / file closed
                            
                                How to use sliderInput to to input more than 2 values in Rshiny
                            
                                Adding legend to a single line chart using ggplot
                            
                                How to use anonymous functions for mutate_each (and summarise_each)? [duplicate]
                            
                                How can I use R to get confidence intervals in Azure ML? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With