Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there an R function to escape a string for regex characters

Tags:

string

regex

r

I'm wanting to build a regex expression substituting in some strings to search for, and so these string need to be escaped before I can put them in the regex, so that if the searched for string contains regex characters it still works.

Some languages have functions that will do this for you (e.g. python re.escape: https://stackoverflow.com/a/10013356/1900520). Does R have such a function?

For example (made up function):

x = "foo[bar]"
y = escape(x) # y should now be "foo\\[bar\\]"
like image 590
Corvus Avatar asked Feb 12 '13 16:02

Corvus


People also ask

How do you escape a string in regex?

The backslash in a regular expression precedes a literal character. You also escape certain letters that represent common character classes, such as \w for a word character or \s for a space. The following example matches word characters (alphanumeric and underscores) and spaces.

How do you write a character escaped string?

You can write: String newstr = "\\"; \ is a special character within a string used for escaping. "\" does now work because it is escaping the second " .

What is regex escape?

Regex. Escape is there to "escape" a string that may contain characters that have special meaning in a Regex. For example (a simple example): Let's say I wanted to search a string based on user input. One would assume I could write a regex like ".


3 Answers

I've written an R version of Perl's quotemeta function:

library(stringr)
quotemeta <- function(string) {
  str_replace_all(string, "(\\W)", "\\\\\\1")
}

I always use the perl flavor of regexps, so this works for me. I don't know whether it works for the "normal" regexps in R.

Edit: I found the source explaining why this works. It's in the Quoting Metacharacters section of the perlre manpage:

This was once used in a common idiom to disable or quote the special meanings of regular expression metacharacters in a string that you want to use for a pattern. Simply quote all non-"word" characters:

$pattern =~ s/(\W)/\\$1/g;

As you can see, the R code above is a direct translation of this same substitution (after a trip through backslash hell). The manpage also says (emphasis mine):

Unlike some other regular expression languages, there are no backslashed symbols that aren't alphanumeric.

which reinforces my point that this solution is only guaranteed for PCRE.

like image 176
Ryan C. Thompson Avatar answered Oct 07 '22 14:10

Ryan C. Thompson


Apparently there is a function called escapeRegex in the Hmisc package. The function itself has the following definition for an input value of 'string':

gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", string)

My previous answer:

I'm not sure if there is a built in function but you could make one to do what you want. This basically just creates a vector of the values you want to replace and a vector of what you want to replace them with and then loops through those making the necessary replacements.

re.escape <- function(strings){
    vals <- c("\\\\", "\\[", "\\]", "\\(", "\\)", 
              "\\{", "\\}", "\\^", "\\$","\\*", 
              "\\+", "\\?", "\\.", "\\|")
    replace.vals <- paste0("\\\\", vals)
    for(i in seq_along(vals)){
        strings <- gsub(vals[i], replace.vals[i], strings)
    }
    strings
}

Some output

> test.strings <- c("What the $^&(){}.*|?", "foo[bar]")
> re.escape(test.strings)
[1] "What the \\$\\^&\\(\\)\\{\\}\\.\\*\\|\\?"
[2] "foo\\[bar\\]"  
like image 16
Dason Avatar answered Oct 07 '22 12:10

Dason


An easier way than @ryanthompson function is to simply prepend \\Q and postfix \\E to your string. See the help file ?base::regex.

like image 4
Paul Lemmens Avatar answered Oct 07 '22 14:10

Paul Lemmens