I'm puzzled by the output of the 3 following test :
This one includes a special character « ° » and gives the good outcome :
sub(pattern = ".*([[:digit:]]{5}).*", replacement = "\\1", x = "A°C 01160")
[1] "01160"
This one includes a quote and gives the the good outcome :
sub(pattern = ".*([[:digit:]]{5}).*", replacement = "\\1", x = "01160 'aa")
[1] "01160"
But this one includes ° and a quote and return a weird outcome
sub(pattern = ".*([[:digit:]]{5}).*", replacement = "\\1", x = "A°C 01160 'aa")
[1] "0 'aa"
By the way, I'm also puzzled by the fact that the outcome isn't the same if I give the same input as a vector :
sub(pattern = ".*([[:digit:]]{5}).*", replacement = "\\1", x = c("A°C 01160", "01160 'aa", "A°C 01160 'aa"))
[1] "01160" "0 'aa" "0 'aa"
Does anyone has a clue to understand the origin of my problem ?
I run R 3.02 on Mac OS 10.8 with French UTF-8 encoding options :
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.0.2
To be able to use special characters within a function such as gsub, we have to add two backslashes (i.e. \\) in front of the special character. …the next R syntax replaces the question mark… Looks good! We can use the previous type of R code for basically any special character.
To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).
Placing r or R before a string literal creates what is known as a raw-string literal. Raw strings do not process escape sequences ( \n , \b , etc.) and are thus commonly used for Regex patterns, which often contain a lot of \ characters.
The \D metacharacter matches non-digit characters.
To use special characters in a regular expression the simplest method is usually to escape them with a backslash, but as noted above, the backslash itself needs to be escaped. grepl("\\[", "a[b") ## [1] TRUE To match backslashes, you need to double escape, resulting in four backslashes.
In the regular expression above, each ‘\\d’ means a digit, and ‘.’ can match anything in between (look at the number 1 in the list of expressions in the beginning). So we got the digits, then a special character in between, three more digits, then special characters again, then 4 more digits. So anything that matches these criteria were extracted.
To be able to use special characters within a function such as gsub, we have to add two backslashes (i.e. \) in front of the special character. The following R code replaces the $ sign… …the next R syntax replaces the question mark…
A dot is special character in regular expressions. It is also known as wildcard character i.e. it is used to match any character other than (new line). Now let us try to escape it using the double backslash ( \ ).
Interpretation of named character classes like including [:digit:]
depends upon the locale in question. They can encompass non-ASCII characters.
[[:digit:]]
would match any character in the Unicode Nd category.
If you want to match only ASCII-decimal digits, use [0-9]
.
> sub(pattern = ".*([[:digit:]]{5}).*", replacement = "\\1", x = "A°C 01160 'aa")
[1] "0 'aa"
> sub(pattern = ".*([0-9]{5}).*", replacement = "\\1", x = "A°C 01160 'aa")
[1] "01160"
>
Moreover, your observation isn't really specific to R
. Quoting from regex:
Certain named classes of characters are predefined. Their interpretation depends on the locale (see locales); the interpretation below is that of the POSIX locale.
EDIT: Demo of what has been mentioned above:
> Sys.getlocale()
[1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"
> sub(pattern = ".*([[:digit:]]{5}).*", replacement = "\\1", x = "A°C 01160 'aa")
[1] "0 'aa"
> Sys.setlocale("LC_ALL", "C")
[1] "LC_CTYPE=C;LC_NUMERIC=C;LC_TIME=C;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"
> sub(pattern = ".*([[:digit:]]{5}).*", replacement = "\\1", x = "A°C 01160 'aa")
[1] "01160"
>
To elaborate on the demo, the same substitution returned different results for different locales. The result was as expected when switching to C
locale.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With