Using <code>stringr</code> i tried to detect a <code>€</code> sign at the end of a string as follows: <pre class="prettyprint"><code>str_detect("my text €", "€\\b") # FALSE </code></pre> Why is this not working? It is working in the following cases: <pre class="prettyprint"><code>str_detect("my text a", "a\\b") # TRUE - letter instead of € grepl("€\\b", "2009in €") # TRUE - base R solution </code></pre> But it also fails in perl mode: <pre class="prettyprint"><code>grepl("€\\b", "2009in €", perl=TRUE) # FALSE </code></pre> So what is wrong about the <code>€\\b</code>-regex? The regex <code>€$</code> is working in all cases...

<pre class="prettyprint"><code>\b </code></pre> is equivalent to <pre class="prettyprint"><code>(?:(?<!\w)(?=\w)|(?<=\w)(?!\w)) </code></pre> which is to say it matches <ul> <li>between a word char and a non-word char, </li> <li>between a word char and the start of the string, and</li> <li>between a word char and the end of the string.</li> </ul> <code>€</code> is a symbol, and symbols aren't word characters. <pre class="prettyprint"><code>$ uniprops € U+20AC <€> \N{EURO SIGN} \pS \p{Sc} All Any Assigned Common Zyyy Currency_Symbol Sc Currency_Symbols S Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Print X_POSIX_Print Symbol Unicode </code></pre> If your language supports look-behinds and look-aheads, you could use the following to find a boundary between a space and non-space (treating the start and end as a space). <pre class="prettyprint"><code>(?:(?<!\S)(?=\S)|(?<=\S)(?!\S)) </code></pre>

Why does is this end of line (\\b) not recognised as word boundary in stringr/ICU and Perl

Tags:

regex

r

pcre

stringr

Using stringr i tried to detect a € sign at the end of a string as follows:

str_detect("my text €", "€\\b") # FALSE

Why is this not working? It is working in the following cases:

str_detect("my text a", "a\\b") # TRUE - letter instead of €
grepl("€\\b", "2009in €") # TRUE - base R solution

But it also fails in perl mode:

grepl("€\\b", "2009in €", perl=TRUE) # FALSE

So what is wrong about the €\\b-regex? The regex €$ is working in all cases...

856

asked Dec 15 '16 23:12

Rentrop

2 Answers

When you use base R regex functions without perl=TRUE, TRE regex flavor is used.

It appears that TRE word boundary:

When used after a non-word character matches the end of string position, and
When used before a non-word character matches the start of string position.

See the R tests:

> gsub("\\b\\)", "HERE", ") 2009in )")
[1] "HERE 2009in )"
> gsub("\\)\\b", "HERE", ") 2009in )")
[1] ") 2009in HERE"
>

This is not a common behavior of a word boundary in PCRE and ICU regex flavors where a word boundary before a non-word character only matches when the character is preceded with a word char, excluding the start of string position (and when used after a non-word character requires a word character to appear right after the word boundary):

There are three different positions that qualify as word boundaries:

- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.

106

answered Oct 11 '22 20:10

Wiktor Stribiżew

\b

is equivalent to

(?:(?<!\w)(?=\w)|(?<=\w)(?!\w))

which is to say it matches

between a word char and a non-word char,
between a word char and the start of the string, and
between a word char and the end of the string.

€ is a symbol, and symbols aren't word characters.

$ uniprops €
U+20AC <€> \N{EURO SIGN}
    \pS \p{Sc}
    All Any Assigned Common Zyyy Currency_Symbol Sc Currency_Symbols S Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Print X_POSIX_Print Symbol Unicode

If your language supports look-behinds and look-aheads, you could use the following to find a boundary between a space and non-space (treating the start and end as a space).

(?:(?<!\S)(?=\S)|(?<=\S)(?!\S))

answered Oct 11 '22 21:10

ikegami

Related questions
                            
                                Create a map of spatial clusters LISA in R
                            
                                r mask for grep for finding the repeated words
                            
                                Automatically run a function when loading a R package
                            
                                Adding a ribbon when faceting in ggplot2
                            
                                R: Does the use of the print function inside a for loop slow down R
                            
                                multiple series in Highcharter R stacked barchart
                            
                                Shiny: Using uiOutput inside a module, or a nested module?
                            
                                white space between title and plot in ioslides
                            
                                Is there an equivalent of "&" in R's regular expressions for backreference to entire match?
                            
                                plotly bar and line chart
                            
                                LOESS warnings/errors related to span in R
                            
                                How to add other characters, such as arrow heads, in lines()?
                            
                                Tableau-like grouped table in R for markdown
                            
                                Adding legend to a multi-histogram ggplot
                            
                                R data table - create a new column where each element is a list of values
                            
                                ggplot2: Add name of variable used for facet_grid
                            
                                How to arrange HTML Widgets inside of a RMarkdown Document (PDF, HTML)
                            
                                Differentiate missing values from main data in a plot using R
                            
                                R - Plotting Hexagon Tessellations
                            
                                How to show the progress of code in R?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With