I'm trying to remove non-alphabet characters from a vector of strings. I thought the [:punct:]
grouping would cover it, but it seems to ignore the +
. Does this belong to another group of characters?
library(stringi)
string1 <- c(
"this is a test"
,"this, is also a test"
,"this is the final. test"
,"this is the final + test!"
)
string1 <- stri_replace_all_regex(string1, '[:punct:]', ' ')
string1 <- stri_replace_all_regex(string1, '\\+', ' ')
In POSIX-like regex engines, punct
stands for
the character class corresponding to the ispunct()
classification
function (check out man 3 ispunct
on UNIX-like systems).
According to ISO/IEC 9899:1990 (ISO C90), the ispunct()
function tests
for any printing character except for space or a character for which
isalnum()
is true. However, in POSIX setting, the details of what
characters belong into which class depend on the current locale.
So the punct
class here will not lead to portable code,
see the ICU user guide on C/POSIX Migration
for more details.
On the other hand, the ICU library, on which stringi relies, and which fully conforms to the Unicode standard, defines some of the charclasses in its own -- but well-defined and always portable -- way.
In particular, according to the Unicode standard,
the PLUS SIGN
(U+002B
) is of Symbol, Math
(Sm
) category (and is not a Puctuation Mark
(P
)).
library("stringi")
ascii <- stri_enc_fromutf32(1:127)
stri_extract_all_regex(ascii, "[[:punct:]]")[[1]]
## [1] "!" "\"" "#" "%" "&" "'" "(" ")" "*" "," "-" "." "/" ":" ";" "?" "@" "[" "\\" "]" "_" "{" "}"
stri_extract_all_regex(ascii, "[[:symbol:]]")[[1]]
## [1] "$" "+" "<" "=" ">" "^" "`" "|" "~"
So here you should rather use such character sets
as [[:punct:][:symbol:]]
, [[:punct:]+]
,
or even better [\\p{P}\\p{S}]
or
[\\p{P}+]
.
For details on available character classes, check out
?"stringi-search-charclass"
.
In particular, ICU User Guide on UnicodeSet
and Unicode Standard Annex #44: Unicode character database
maybe of your interest. HTH
POSIX character classes need to be wrapped inside of a character class, the correct form would be [[:punct:]]
. Do not confuse the POSIX term "character class" with what is normally called a regex character class.
This POSIX named class in the ASCII range matches all non-controls, non-alphanumeric, non-space characters.
ascii <- rawToChar(as.raw(0:127), multiple=T)
paste(ascii[grepl('[[:punct:]]', ascii)], collapse="")
# [1] "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~"
Although if a locale
is in effect, it could alter the behavior of [[:punct:]]
...
R Documentation ?regex
states the following: Certain named classes of characters are predefined. Their interpretation depends on the locale (see locales); the interpretation is that of the POSIX locale.
The Open Group LC_TYPE definition for punct says:
Define characters to be classified as punctuation characters.
In the POSIX locale, neither the
<space>
nor any characters in classes alpha, digit, or cntrl shall be included.In a locale definition file, no character specified for the keywords upper, lower, alpha, digit, cntrl, xdigit, or as the
<space>
shall be specified.
However, the stringi package seems to depend on ICU and locale is a fundamental concept in ICU.
Using the stringi package, I recommend using the Unicode Properties \p{P}
and \p{S}
.
\p{P}
matches any kind of punctuation character. That is, it is missing nine of the characters that the POSIX class punct includes. This is because Unicode splits what POSIX considers to be punctuation into two categories, Punctuation and Symbols. This is where \p{S}
comes into place ...
stri_replace_all_regex(string1, '[\\p{P}\\p{S}]', ' ')
# [1] "this is a test" "this is also a test"
# [3] "this is the final test" "this is the final test "
Or fallback to gsub
from base R which handles this very well.
gsub('[[:punct:]]', ' ', string1)
# [1] "this is a test" "this is also a test"
# [3] "this is the final test" "this is the final test "
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With