Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Applying a function to a backreference within gsub in R

I'm new to R and am stuck with backreferencing that doesn't seem to work. In:

gsub("\\((\\d+)\\)", f("\\1"), string)

It correctly grabs the number in between parentheses but doesn't apply the (correctly defined, working otherwise) function f to replace the number --> it's actually the string "\1" that passes through to f.

Am I missing something or is it just that R does not handle this? If so, any idea how I could do something similar, i.e. applying a function "on the fly" to the (actually many) numbers that occur in between parentheses in the text I'm parsing?

Thanks a lot for your help.

like image 305
JMD Avatar asked Aug 26 '14 13:08

JMD


People also ask

Which R function would you use to replace all instances of a character string within a character vector?

The sub() function in R is used to replace the string in a vector or a data frame with the input or the specified string.

Which function in R replaces all the instances of a substring?

We can replace all occurrences of a particular character using gsub() function.

How do I GSUB a column in R?

To remove a character in an R data frame column, we can use gsub function which will replace the character with blank. For example, if we have a data frame called df that contains a character column say x which has a character ID in each value then it can be removed by using the command gsub("ID","",as.

What package is GSUB in?

Description Generalized "gsub" and associated functions. gsubfn is an R package used for string matching, substitution and parsing.


2 Answers

R does not have the option of applying a function directly to a match via gsub. You'll actually have to extract the match, transform the value, then replace the value. This is relativaly easy with the regmatches function. For example

x<-"(990283)M (31)O (29)M (6360)M"

f<-function(x) {
    v<-as.numeric(substr(x,2,nchar(x)-1))
    paste0(v+5,".1")
}

m <- gregexpr("\\(\\d+\\)", x)
regmatches(x, m) <- lapply(regmatches(x, m), f)
x
# [1] "990288.1M 36.1O 34.1M 6365.1M"

Of course you can make f do whatever you like just make sure it's vector-friendly. Of course, you could wrap this in your own function

gsubf <- function(pattern, x, f) {
    m <- gregexpr(pattern, x)
    regmatches(x, m) <- lapply(regmatches(x, m), f)
    x   
}
gsubf("\\(\\d+\\)", x, f)

Note that in these examples we're not using a capture group, we're just grabbing the entire match. There are ways to extract the capture groups but they are a bit messier. If you wanted to provide an example where such an extraction is required, I might be able to come up with something fancier.

like image 80
MrFlick Avatar answered Sep 26 '22 19:09

MrFlick


To use a callback within a regex-capable replacement function, you may use either gsubfn or stringr functions.

When choosing between them, note that stringr is based on ICU regex engine and with gsubfn, you may use either the default TCL (if the R installation has tcltk capability, else it is the default TRE) or PCRE (if you pass the perl=TRUE argument).

Also, note that gsubfn allows access to all capturing groups in the match object, while str_replace_all will only allow to manipulate the whole match only. Thus, for str_replace_all, the regex should look like (?<=\()\d+(?=\)), where 1+ digits are matched only when they are enclosed with ( and ) excluding them from the match.

With stringr, you may use str_replace_all:

library(stringr)  
string <- "(990283)M (31)O (29)M (6360)M"
## Callback function to increment found number:
f <- function(x) { as.integer(x) + 1 }
str_replace_all(string, "(?<=\\()\\d+(?=\\))", function(m) f(m))
## => [1] "(990284)M (32)O (30)M (6361)M"

With gsubfn, pass perl=TRUE and backref=0 to be able to use lookarounds and just modify the whole match:

gsubfn("(?<=\\()\\d+(?=\\))", ~ f(m), string, perl=TRUE, backref=0)
## => [1] "(990284)M (32)O (30)M (6361)M"

If you have multiple groups in the pattern, remoe backref=0 and enumerate the group value arguments in the callback function declaration:

gsubfn("(\\()(\\d+)(\\))", function(m,n,o) paste0(m,f(n),o), string, perl=TRUE)
        ^ 1 ^^  2 ^^ 3 ^           ^^^^^^^          ^^^^   
like image 22
Wiktor Stribiżew Avatar answered Sep 24 '22 19:09

Wiktor Stribiżew