Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R regex: issues with character vectors containing NAs

Tags:

regex

r

I was trying to collapse all multiple (2 or more) whitespace characters within elements of a vector into a single one, using gsub(), e.g.:

x1 <- c("  abc", "a b c    ", "a  b c")
gsub("\\s{2,}", " ", x1)
[1] " abc"   "a b c " "a b c"

But as soon as the vector contains NA the substitution fails:

x2 <- c(NA, "  abc", "a b c    ", "a  b c")
gsub("\\s{2,}", " ", x2)
[1] NA  " " " " " "

However, it works fine if one uses Perl-like regular expressions:

gsub("\\s{2,}", " ", x2, perl = TRUE)
[1] NA       " abc"   "a b c " "a b c"

Does anyone have suggestions as to why R's own regular expressions behave in that way? I'm using R 3.1.1 on Linux x86-64 if that helps.

like image 886
Roland Seubert Avatar asked Oct 03 '14 06:10

Roland Seubert


2 Answers

I haven't poked at the source code but it also works if you use the useBytes=TRUE parameter (without the perl=TRUE parameter). From the help: "if useBytes is TRUE the matching is done byte-by-byte rather than character-by-character." That may be part of why it's failing in gsub.

However, regexpr, regexec and gregexpr each find all the correct positions (I have substituted \\s with [[:space:]]: for readability and only used output from regexpr:

regexpr("[[:space:]]{2,}", x2)

## [1] NA  1  1  1
## attr(,"match.length")
## [1] NA  5  9  6

So, the regex itself is fine.

Update: a quick glance at do_gsub in R 3.1.1's grep.c didn't yield much insight (it's a twisted maze of if/else statements :-), but I'd almost want to call this a bug.

like image 58
hrbrmstr Avatar answered Oct 20 '22 18:10

hrbrmstr


Just to wrap this question up: as several others suggested, the behaviour is in fact a bug. Reported and confirmed here:

https://bugs.r-project.org/bugzilla/show_bug.cgi?id=16009

like image 44
Roland Seubert Avatar answered Oct 20 '22 17:10

Roland Seubert