Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I deal with special characters like \^$.?*|+()[{ in my regex?

Tags:

regex

r

r-faq

I want to match a regular expression special character, \^$.?*|+()[{. I tried:

x <- "a[b" grepl("[", x) ## Error: invalid regular expression '[', reason 'Missing ']'' 

(Equivalently stringr::str_detect(x, "[") or stringi::stri_detect_regex(x, "[").)

Doubling the value to escape it doesn't work:

grepl("[[", x) ## Error: invalid regular expression '[[', reason 'Missing ']'' 

Neither does using a backslash:

grepl("\[", x) ## Error: '\[' is an unrecognized escape in character string starting ""\[" 

How do I match special characters?


Some special cases of this in questions that are old and well written enough for it to be cheeky to close as duplicates of this:
Escaped Periods In R Regular Expressions
How to escape a question mark in R?
escaping pipe ("|") in a regex

like image 755
Richie Cotton Avatar asked Dec 31 '14 12:12

Richie Cotton


People also ask

What does (?: Mean in regex?

(?:...) A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

How do you replace special characters in regex?

If you are having a string with special characters and want's to remove/replace them then you can use regex for that. Use this code: Regex. Replace(your String, @"[^0-9a-zA-Z]+", "")


1 Answers

Escape with a double backslash

R treats backslashes as escape values for character constants. (... and so do regular expressions. Hence the need for two backslashes when supplying a character argument for a pattern. The first one isn't actually a character, but rather it makes the second one into a character.) You can see how they are processed using cat.

y <- "double quote: \", tab: \t, newline: \n, unicode point: \u20AC" print(y) ## [1] "double quote: \", tab: \t, newline: \n, unicode point: €" cat(y) ## double quote: ", tab:    , newline:  ## , unicode point: € 

Further reading: Escaping a backslash with a backslash in R produces 2 backslashes in a string, not 1

To use special characters in a regular expression the simplest method is usually to escape them with a backslash, but as noted above, the backslash itself needs to be escaped.

grepl("\\[", "a[b") ## [1] TRUE 

To match backslashes, you need to double escape, resulting in four backslashes.

grepl("\\\\", c("a\\b", "a\nb")) ## [1]  TRUE FALSE 

The rebus package contains constants for each of the special characters to save you mistyping slashes.

library(rebus) OPEN_BRACKET ## [1] "\\[" BACKSLASH ## [1] "\\\\" 

For more examples see:

?SpecialCharacters 

Your problem can be solved this way:

library(rebus) grepl(OPEN_BRACKET, "a[b") 

Form a character class

You can also wrap the special characters in square brackets to form a character class.

grepl("[?]", "a?b") ## [1] TRUE 

Two of the special characters have special meaning inside character classes: \ and ^.

Backslash still needs to be escaped even if it is inside a character class.

grepl("[\\\\]", c("a\\b", "a\nb")) ## [1]  TRUE FALSE 

Caret only needs to be escaped if it is directly after the opening square bracket.

grepl("[ ^]", "a^b")  # matches spaces as well. ## [1] TRUE grepl("[\\^]", "a^b")  ## [1] TRUE 

rebus also lets you form a character class.

char_class("?") ## <regex> [?] 

Use a pre-existing character class

If you want to match all punctuation, you can use the [:punct:] character class.

grepl("[[:punct:]]", c("//", "[", "(", "{", "?", "^", "$")) ## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE 

stringi maps this to the Unicode General Category for punctuation, so its behaviour is slightly different.

stri_detect_regex(c("//", "[", "(", "{", "?", "^", "$"), "[[:punct:]]") ## [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE 

You can also use the cross-platform syntax for accessing a UGC.

stri_detect_regex(c("//", "[", "(", "{", "?", "^", "$"), "\\p{P}") ## [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE 

Use \Q \E escapes

Placing characters between \\Q and \\E makes the regular expression engine treat them literally rather than as regular expressions.

grepl("\\Q.\\E", "a.b") ## [1] TRUE 

rebus lets you write literal blocks of regular expressions.

literal(".") ## <regex> \Q.\E 

Don't use regular expressions

Regular expressions are not always the answer. If you want to match a fixed string then you can do, for example:

grepl("[", "a[b", fixed = TRUE) stringr::str_detect("a[b", fixed("[")) stringi::stri_detect_fixed("a[b", "[") 
like image 158
9 revs, 4 users 93% Avatar answered Sep 18 '22 19:09

9 revs, 4 users 93%