I want to identify strings which feature two keywords that have between 0 and 3 words between them. What I have works in most cases:
strings <- c(
"Today is my birthday",
"Today is not yet my birthday",
"Today birthday",
"Today maybe?",
"Today: birthday"
)
grepl("Today(\\s\\w+){0,3}\\sbirthday", strings, ignore.case = TRUE)
#> [1] TRUE FALSE TRUE FALSE FALSE
Created on 2021-11-24 by the reprex package (v2.0.1)
My issue is with the string "Today: birthday"
. The problem is that a word is defined as (\\s\\w+)
leaving no option for the sentence to contain any punctuation. How can I better define the regex for word so that punctuation is not excluded (best would be to ignore it).
Python 3 - String split() MethodThe split() method returns a list of all the words in the string, using str as the separator (splits on all whitespace if left unspecified), optionally limiting the number of splits to num.
split() method accepts two arguments. The first optional argument is separator , which specifies what kind of separator to use for splitting the string. If this argument is not provided, the default value is any whitespace, meaning the string will split whenever .
Python String split() MethodThe split() method splits a string into a list. You can specify the separator, default separator is any whitespace. Note: When maxsplit is specified, the list will contain the specified number of elements plus one.
Method 1: re.split(pattern, string) method matches all occurrences of the pattern in the string and divides the string along the matches resulting in a list of strings between the matches. For example, re. split('a', 'bbabbbab') results in the list of strings ['bb', 'bbb', 'b'] .
You can use
> grepl("Today(\\W+\\w+){0,3}\\W+birthday", strings, ignore.case = TRUE)
[1] TRUE FALSE TRUE FALSE TRUE
Also, consider using word boundaries, non-capturing groups, and the more stable PCRE regex engine:
grepl("\\bToday(?:\\W+\\w+){0,3}\\W+birthday\\b", strings, ignore.case = TRUE, perl=TRUE)
The (?:\W+\w+){0,3}\W+
part matches zero to three occurrences of one or more non-word chars (\W+
) and then one or more word chars (\w+
) and then one or more non-word chars.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With