Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

">" is not matched by "[[:punct:]]" when using `stringr::str_replace_all`? [duplicate]

Tags:

regex

r

stringr

I find this really odd :

pattern <- "[[:punct:][:digit:][:space:]]+"
string  <- "a . , > 1 b"

gsub(pattern, " ", string)
# [1] "a b"

library(stringr)
str_replace_all(string, pattern, " ")
# [1] "a > b"

str_replace_all(string, "[[:punct:][:digit:][:space:]>]+", " ")
# [1] "a b"

Is this expected ?

like image 211
Moody_Mudskipper Avatar asked Nov 02 '18 13:11

Moody_Mudskipper


1 Answers

Still working on this, but ?"stringi-search-charclass" says:

Beware of using POSIX character classes, e.g. ‘[:punct:]’. ICU User Guide (see below) states that in general they are not well-defined, so may end up with something different than you expect.

In particular, in POSIX-like regex engines, ‘[:punct:]’ stands for the character class corresponding to the ‘ispunct()’ classification function (check out ‘man 3 ispunct’ on UNIX-like systems). According to ISO/IEC 9899:1990 (ISO C90), the ‘ispunct()’ function tests for any printing character except for space or a character for which ‘isalnum()’ is true. However, in a POSIX setting, the details of what characters belong into which class depend on the current locale. So the ‘[:punct:]’ class does not lead to portable code (again, in POSIX-like regex engines).

So a POSIX flavor of ‘[:punct:]’ is more like ‘[\p{P}\p{S}]’ in ‘ICU’. You have been warned.

Copying from the issue posted above,

string  <- "a . , > 1 b"
mypunct <- "[[\\p{P}][\\p{S}]]" 
stringr::str_remove_all(string, mypunct)

I can appreciate stuff being locale-specific, but it still surprises me that [:punct:] doesn't even work in a C locale ...

like image 82
Ben Bolker Avatar answered Nov 14 '22 20:11

Ben Bolker