I was trying to collapse all multiple (2 or more) whitespace characters within elements of a vector into a single one, using gsub()
, e.g.:
x1 <- c(" abc", "a b c ", "a b c")
gsub("\\s{2,}", " ", x1)
[1] " abc" "a b c " "a b c"
But as soon as the vector contains NA
the substitution fails:
x2 <- c(NA, " abc", "a b c ", "a b c")
gsub("\\s{2,}", " ", x2)
[1] NA " " " " " "
However, it works fine if one uses Perl-like regular expressions:
gsub("\\s{2,}", " ", x2, perl = TRUE)
[1] NA " abc" "a b c " "a b c"
Does anyone have suggestions as to why R's own regular expressions behave in that way? I'm using R 3.1.1 on Linux x86-64 if that helps.
I haven't poked at the source code but it also works if you use the useBytes=TRUE
parameter (without the perl=TRUE
parameter). From the help: "if useBytes
is TRUE
the matching is done byte-by-byte rather than character-by-character." That may be part of why it's failing in gsub
.
However, regexpr
, regexec
and gregexpr
each find all the correct positions (I have substituted \\s
with [[:space:]]:
for readability and only used output from regexpr
:
regexpr("[[:space:]]{2,}", x2)
## [1] NA 1 1 1
## attr(,"match.length")
## [1] NA 5 9 6
So, the regex itself is fine.
Update: a quick glance at do_gsub
in R 3.1.1's grep.c
didn't yield much insight (it's a twisted maze of if/else
statements :-), but I'd almost want to call this a bug.
Just to wrap this question up: as several others suggested, the behaviour is in fact a bug. Reported and confirmed here:
https://bugs.r-project.org/bugzilla/show_bug.cgi?id=16009
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With