I want to identify strings which feature two keywords that have between 0 and 3 words between them. What I have works in most cases: <pre class="prettyprint lang-r prettyprint-override"><code>strings <- c( "Today is my birthday", "Today is not yet my birthday", "Today birthday", "Today maybe?", "Today: birthday" ) grepl("Today(\\s\\w+){0,3}\\sbirthday", strings, ignore.case = TRUE) #> [1] TRUE FALSE TRUE FALSE FALSE </code></pre> Created on 2021-11-24 by the reprex package (v2.0.1) My issue is with the string <code>"Today: birthday"</code>. The problem is that a word is defined as <code>(\\s\\w+)</code> leaving no option for the sentence to contain any punctuation. How can I better define the regex for word so that punctuation is not excluded (best would be to ignore it).

You can use <pre class="prettyprint lang-r prettyprint-override"><code>> grepl("Today(\\W+\\w+){0,3}\\W+birthday", strings, ignore.case = TRUE) [1] TRUE FALSE TRUE FALSE TRUE </code></pre> Also, consider using word boundaries, non-capturing groups, and the more stable PCRE regex engine: <pre class="prettyprint lang-r prettyprint-override"><code>grepl("\\bToday(?:\\W+\\w+){0,3}\\W+birthday\\b", strings, ignore.case = TRUE, perl=TRUE) </code></pre> The <code>(?:\W+\w+){0,3}\W+</code> part matches zero to three occurrences of one or more non-word chars (<code>\W+</code>) and then one or more word chars (<code>\w+</code>) and then one or more non-word chars.

Find two keywords if they are between 0 and 3 words apart

Tags:

regex

r

I want to identify strings which feature two keywords that have between 0 and 3 words between them. What I have works in most cases:

strings <- c(
  "Today is my birthday",
  "Today is not yet my birthday",
  "Today birthday",
  "Today maybe?",
  "Today: birthday"
)


grepl("Today(\\s\\w+){0,3}\\sbirthday", strings, ignore.case = TRUE)
#> [1]  TRUE FALSE  TRUE FALSE FALSE

^{Created on 2021-11-24 by the reprex package (v2.0.1)}

My issue is with the string "Today: birthday". The problem is that a word is defined as (\\s\\w+) leaving no option for the sentence to contain any punctuation. How can I better define the regex for word so that punctuation is not excluded (best would be to ignore it).

741

asked Oct 14 '22 19:10

JBGruber

1 Answers

You can use

> grepl("Today(\\W+\\w+){0,3}\\W+birthday", strings, ignore.case = TRUE)
[1]  TRUE FALSE  TRUE FALSE  TRUE

Also, consider using word boundaries, non-capturing groups, and the more stable PCRE regex engine:

grepl("\\bToday(?:\\W+\\w+){0,3}\\W+birthday\\b", strings, ignore.case = TRUE, perl=TRUE)

The (?:\W+\w+){0,3}\W+ part matches zero to three occurrences of one or more non-word chars (\W+) and then one or more word chars (\w+) and then one or more non-word chars.

answered Oct 19 '22 10:10

Wiktor Stribiżew

Related questions
                            
                                revoScaleR::rxGlm() Question in R - GLM Residuals
                            
                                change data in column with the previous information in another column
                            
                                Rcpp rowMaxs vs. matrixStats rowMaxs
                            
                                Question about p-values with clustered standard errors in LFE package in R
                            
                                Why is the sign character accepted in the function definition?
                            
                                awk code to filter lines in one file according to matching conditions in another file
                            
                                Lookup table based on multiple conditions in R
                            
                                How to get the arrow package for R with lz4 support?
                            
                                How to mutate a column based on values occurring in a particular sequence?
                            
                                Multiple Processes Instead of for loop in R
                            
                                readxl, selected worksheets in single .xlsx-workbook
                            
                                Define mlr3 task using data from a database (different tables)?
                            
                                Setting multiple and different attributes for columns of a data.table
                            
                                How to create matrix of distribution in R
                            
                                How to sort multiple tables in Shiny
                            
                                Remove line from polygon crossing the international dateline in R (e.g. Russia in rnaturalearth)
                            
                                Resetting R random number generator (rlecuyer) for inner loops using Snow/doSNOW
                            
                                Avoid legend duplication in plotly conversion from ggplot with facet_wrap
                            
                                R RJDBC java.lang.OutOfMemoryError
                            
                                Sparse matrix to a data frame in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With