Using stringr
i tried to detect a €
sign at the end of a string as follows:
str_detect("my text €", "€\\b") # FALSE
Why is this not working? It is working in the following cases:
str_detect("my text a", "a\\b") # TRUE - letter instead of €
grepl("€\\b", "2009in €") # TRUE - base R solution
But it also fails in perl mode:
grepl("€\\b", "2009in €", perl=TRUE) # FALSE
So what is wrong about the €\\b
-regex? The regex €$
is working in all cases...
There are three different positions that qualify as word boundaries: At string start, if the first string character is a word character \w. Between two characters in the string, where one is a word character \w and the other is not. At string end, if the last string character is a word character \w.
\Bmatches at not a word boundary. Thus \B'matches at '(space-quote) because neither space nor quote are word characters. It does not match at space'because e'is a word boundary: eis a word character and 'isn't. Share Improve this answer Follow edited Sep 20 '15 at 7:24 answered Sep 20 '15 at 2:22 John1024John1024
At string end, if the last string character is a word character \w. For instance, regexp \bJava\b will be found in Hello, Java!, where Java is a standalone word, but not in Hello, JavaScript!. In the string Hello, Java! following positions correspond to \b:
When the regexp engine (program module that implements searching for regexps) comes across \b, it checks that the position in the string is a word boundary. There are three different positions that qualify as word boundaries: At string start, if the first string character is a word character \w.
When you use base R regex functions without perl=TRUE
, TRE regex flavor is used.
It appears that TRE word boundary:
See the R tests:
> gsub("\\b\\)", "HERE", ") 2009in )")
[1] "HERE 2009in )"
> gsub("\\)\\b", "HERE", ") 2009in )")
[1] ") 2009in HERE"
>
This is not a common behavior of a word boundary in PCRE and ICU regex flavors where a word boundary before a non-word character only matches when the character is preceded with a word char, excluding the start of string position (and when used after a non-word character requires a word character to appear right after the word boundary):
There are three different positions that qualify as word boundaries:
- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.
\b
is equivalent to
(?:(?<!\w)(?=\w)|(?<=\w)(?!\w))
which is to say it matches
€
is a symbol, and symbols aren't word characters.
$ uniprops €
U+20AC <€> \N{EURO SIGN}
\pS \p{Sc}
All Any Assigned Common Zyyy Currency_Symbol Sc Currency_Symbols S Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Print X_POSIX_Print Symbol Unicode
If your language supports look-behinds and look-aheads, you could use the following to find a boundary between a space and non-space (treating the start and end as a space).
(?:(?<!\S)(?=\S)|(?<=\S)(?!\S))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With