I have a very special question concerning regular expressions in R:
grepl("(|^)over","stackoverflow")
# [1] TRUE
grepl("(^|)over","stackoverflow")
# [1] FALSE
grepl("(^|x|)over","stackoverflow")
# [1] FALSE
grepl("(x|^|)over","stackoverflow")
# [1] FALSE
grepl("(x||^)over","stackoverflow")
# [1] TRUE
Why do not all those expressions evaluate to TRUE
?
*$ means - match, from beginning to end, any character that appears zero or more times. Basically, that means - match everything from start to end of the string. This regex pattern is not very useful. Let's take a regex pattern that may be a bit useful.
The Difference Between \s and \s+ The plus sign + is a greedy quantifier, which means one or more times. For example, expression X+ matches one or more X characters. Therefore, the regular expression \s matches a single whitespace character, while \s+ will match one or more whitespace characters.
\s stands for “whitespace character”. Again, which characters this actually includes, depends on the regex flavor. In all flavors discussed in this tutorial, it includes [ \t\r\n\f]. That is: \s matches a space, a tab, a carriage return, a line feed, or a form feed.
POSIX regular expressions actually should make all those True. It appears that R uses a slightly modified version of Ville Laurikari's TRE library that doesn't quite follow the standard. I'd follow @rawr's recommendations and use perl = TRUE
for more compliant regular expressions.
See also: When both halves of an OR regex group match, is it defined which will be chosen?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With