Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does R deal with special characters in regulars expressions?

Tags:

regex

r

I'm puzzled by the output of the 3 following test :

This one includes a special character « ° » and gives the good outcome :

sub(pattern = ".*([[:digit:]]{5}).*", replacement = "\\1", x = "A°C 01160")
[1] "01160"

This one includes a quote and gives the the good outcome :

sub(pattern = ".*([[:digit:]]{5}).*", replacement = "\\1", x = "01160 'aa")
[1] "01160"

But this one includes ° and a quote and return a weird outcome

sub(pattern = ".*([[:digit:]]{5}).*", replacement = "\\1", x = "A°C 01160 'aa")
[1] "0 'aa"

By the way, I'm also puzzled by the fact that the outcome isn't the same if I give the same input as a vector :

sub(pattern = ".*([[:digit:]]{5}).*", replacement = "\\1", x = c("A°C 01160", "01160 'aa", "A°C 01160 'aa"))
[1] "01160" "0 'aa" "0 'aa"

Does anyone has a clue to understand the origin of my problem ?

I run R 3.02 on Mac OS 10.8 with French UTF-8 encoding options :

> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.0.2
like image 597
PAC Avatar asked May 12 '14 13:05

PAC


People also ask

How do you handle special characters in R?

To be able to use special characters within a function such as gsub, we have to add two backslashes (i.e. \\) in front of the special character. …the next R syntax replaces the question mark… Looks good! We can use the previous type of R code for basically any special character.

How does regex handle special characters?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).

What does the R do in regex?

Placing r or R before a string literal creates what is known as a raw-string literal. Raw strings do not process escape sequences ( \n , \b , etc.) and are thus commonly used for Regex patterns, which often contain a lot of \ characters.

What does \d mean in regex?

The \D metacharacter matches non-digit characters.

How do I use special characters in regular expressions?

To use special characters in a regular expression the simplest method is usually to escape them with a backslash, but as noted above, the backslash itself needs to be escaped. grepl("\\[", "a[b") ## [1] TRUE To match backslashes, you need to double escape, resulting in four backslashes.

What does the '\\D' mean in a regular expression?

In the regular expression above, each ‘\\d’ means a digit, and ‘.’ can match anything in between (look at the number 1 in the list of expressions in the beginning). So we got the digits, then a special character in between, three more digits, then special characters again, then 4 more digits. So anything that matches these criteria were extracted.

How do you use special characters in a function in R?

To be able to use special characters within a function such as gsub, we have to add two backslashes (i.e. \) in front of the special character. The following R code replaces the $ sign… …the next R syntax replaces the question mark…

What is a dot in a regular expression?

A dot is special character in regular expressions. It is also known as wildcard character i.e. it is used to match any character other than (new line). Now let us try to escape it using the double backslash ( \ ).


1 Answers

Interpretation of named character classes like including [:digit:] depends upon the locale in question. They can encompass non-ASCII characters.

[[:digit:]] would match any character in the Unicode Nd category.

If you want to match only ASCII-decimal digits, use [0-9].

> sub(pattern = ".*([[:digit:]]{5}).*", replacement = "\\1", x = "A°C 01160 'aa")
[1] "0 'aa"
> sub(pattern = ".*([0-9]{5}).*", replacement = "\\1", x = "A°C 01160 'aa")
[1] "01160"
> 

Moreover, your observation isn't really specific to R. Quoting from regex:

Certain named classes of characters are predefined. Their interpretation depends on the locale (see locales); the interpretation below is that of the POSIX locale.


EDIT: Demo of what has been mentioned above:

> Sys.getlocale()
[1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"
> sub(pattern = ".*([[:digit:]]{5}).*", replacement = "\\1", x = "A°C 01160 'aa")
[1] "0 'aa"
> Sys.setlocale("LC_ALL", "C") 
[1] "LC_CTYPE=C;LC_NUMERIC=C;LC_TIME=C;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"
> sub(pattern = ".*([[:digit:]]{5}).*", replacement = "\\1", x = "A°C 01160 'aa")
[1] "01160"
> 

To elaborate on the demo, the same substitution returned different results for different locales. The result was as expected when switching to C locale.

like image 79
devnull Avatar answered Nov 14 '22 23:11

devnull