Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find two keywords if they are between 0 and 3 words apart

Tags:

regex

r

I want to identify strings which feature two keywords that have between 0 and 3 words between them. What I have works in most cases:

strings <- c(
  "Today is my birthday",
  "Today is not yet my birthday",
  "Today birthday",
  "Today maybe?",
  "Today: birthday"
)


grepl("Today(\\s\\w+){0,3}\\sbirthday", strings, ignore.case = TRUE)
#> [1]  TRUE FALSE  TRUE FALSE FALSE

Created on 2021-11-24 by the reprex package (v2.0.1)

My issue is with the string "Today: birthday". The problem is that a word is defined as (\\s\\w+) leaving no option for the sentence to contain any punctuation. How can I better define the regex for word so that punctuation is not excluded (best would be to ignore it).

like image 741
JBGruber Avatar asked Oct 14 '22 19:10

JBGruber


People also ask

How do you split in Python 3?

Python 3 - String split() MethodThe split() method returns a list of all the words in the string, using str as the separator (splits on all whitespace if left unspecified), optionally limiting the number of splits to num.

Can split () take 2 arguments?

split() method accepts two arguments. The first optional argument is separator , which specifies what kind of separator to use for splitting the string. If this argument is not provided, the default value is any whitespace, meaning the string will split whenever .

How do you separate words in Python?

Python String split() MethodThe split() method splits a string into a list. You can specify the separator, default separator is any whitespace. Note: When maxsplit is specified, the list will contain the specified number of elements plus one.

How do you split numbers and letters in Python?

Method 1: re.split(pattern, string) method matches all occurrences of the pattern in the string and divides the string along the matches resulting in a list of strings between the matches. For example, re. split('a', 'bbabbbab') results in the list of strings ['bb', 'bbb', 'b'] .


1 Answers

You can use

> grepl("Today(\\W+\\w+){0,3}\\W+birthday", strings, ignore.case = TRUE)
[1]  TRUE FALSE  TRUE FALSE  TRUE

Also, consider using word boundaries, non-capturing groups, and the more stable PCRE regex engine:

grepl("\\bToday(?:\\W+\\w+){0,3}\\W+birthday\\b", strings, ignore.case = TRUE, perl=TRUE)

The (?:\W+\w+){0,3}\W+ part matches zero to three occurrences of one or more non-word chars (\W+) and then one or more word chars (\w+) and then one or more non-word chars.

like image 98
Wiktor Stribiżew Avatar answered Oct 19 '22 10:10

Wiktor Stribiżew