Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does is this end of line (\\b) not recognised as word boundary in stringr/ICU and Perl

Using stringr i tried to detect a sign at the end of a string as follows:

str_detect("my text €", "€\\b") # FALSE

Why is this not working? It is working in the following cases:

str_detect("my text a", "a\\b") # TRUE - letter instead of €
grepl("€\\b", "2009in €") # TRUE - base R solution

But it also fails in perl mode:

grepl("€\\b", "2009in €", perl=TRUE) # FALSE

So what is wrong about the €\\b-regex? The regex €$ is working in all cases...

like image 856
Rentrop Avatar asked Dec 15 '16 23:12

Rentrop


People also ask

How do you find the word boundary of a string?

There are three different positions that qualify as word boundaries: At string start, if the first string character is a word character \w. Between two characters in the string, where one is a word character \w and the other is not. At string end, if the last string character is a word character \w.

Does \B match at a word boundary?

\Bmatches at not a word boundary. Thus \B'matches at '(space-quote) because neither space nor quote are word characters. It does not match at space'because e'is a word boundary: eis a word character and 'isn't. Share Improve this answer Follow edited Sep 20 '15 at 7:24 answered Sep 20 '15 at 2:22 John1024John1024

How to find \B at the end of a string?

At string end, if the last string character is a word character \w. For instance, regexp \bJava\b will be found in Hello, Java!, where Java is a standalone word, but not in Hello, JavaScript!. In the string Hello, Java! following positions correspond to \b:

What is a word boundary in regexp?

When the regexp engine (program module that implements searching for regexps) comes across \b, it checks that the position in the string is a word boundary. There are three different positions that qualify as word boundaries: At string start, if the first string character is a word character \w.


2 Answers

When you use base R regex functions without perl=TRUE, TRE regex flavor is used.

It appears that TRE word boundary:

  • When used after a non-word character matches the end of string position, and
  • When used before a non-word character matches the start of string position.

See the R tests:

> gsub("\\b\\)", "HERE", ") 2009in )")
[1] "HERE 2009in )"
> gsub("\\)\\b", "HERE", ") 2009in )")
[1] ") 2009in HERE"
> 

This is not a common behavior of a word boundary in PCRE and ICU regex flavors where a word boundary before a non-word character only matches when the character is preceded with a word char, excluding the start of string position (and when used after a non-word character requires a word character to appear right after the word boundary):

There are three different positions that qualify as word boundaries:

- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.

like image 106
Wiktor Stribiżew Avatar answered Oct 11 '22 20:10

Wiktor Stribiżew


\b

is equivalent to

(?:(?<!\w)(?=\w)|(?<=\w)(?!\w))

which is to say it matches

  • between a word char and a non-word char,
  • between a word char and the start of the string, and
  • between a word char and the end of the string.

is a symbol, and symbols aren't word characters.

$ uniprops €
U+20AC <€> \N{EURO SIGN}
    \pS \p{Sc}
    All Any Assigned Common Zyyy Currency_Symbol Sc Currency_Symbols S Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Print X_POSIX_Print Symbol Unicode

If your language supports look-behinds and look-aheads, you could use the following to find a boundary between a space and non-space (treating the start and end as a space).

(?:(?<!\S)(?=\S)|(?<=\S)(?!\S))
like image 3
ikegami Avatar answered Oct 11 '22 21:10

ikegami