Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unexpected unsymmetrical regular expression behavior of \< and \> in R

Tags:

regex

r

Let me use the following example to illustrate.

str = "we are friends"

The help doc says that

The symbols \< and \> match the empty string at the beginning and end of a word.

So, the following is expected to happen, where a whitespace is added to the end of each word.

gsub("\\>"," ", str)
[1] "we  are  friends "

However, why it won't work when using

gsub("\\<"," ", str)
[1] " w e  a r e  f r i e n d s"

Can some explain why this happens? and what I need to do if I want an extra whitespace added in the front of every word?

like image 856
wen Avatar asked Nov 10 '22 07:11

wen


1 Answers

It is pretty strange but I think this is documented as a warning:

POSIX 1003.2 mode of gsub and gregexpr does not work correctly with repeated word-boundaries (e.g., pattern = "\b"). Use perl = TRUE for such matches (but that may not work as expected with non-ASCII inputs, as the meaning of ‘word’ is system-dependent).

So, use \\b(?=\\w) or (?<!\\w)\\b with perl=T:

str = "we are friends"
gsub('(?<!\\w)\\b', ' ', str, perl=T)

See demo

Output: [1] " we are friends".

like image 101
Wiktor Stribiżew Avatar answered Nov 15 '22 05:11

Wiktor Stribiżew