Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace matches according to the pattern that was matched

Given a set of regular expressions, is there a simple way to match multiple patterns, and replace the matched text according to the pattern that was matched?

For example, for the following data x, each element begins with either a number or a letter, and ends with either a number or a letter. Let's call these patterns num_num (for begins with number, ends with number), num_let (begins with number, ends with letter), let_num, and let_let.

x <- c('123abc', '78fdsaq', 'aq12111', '1p33', '123', 'pzv')
type <- list(
  num_let='^\\d.*[[:alpha:]]$',
  num_num='^\\d(.*\\d)?$',
  let_num='^[[:alpha:]].*\\d$',
  let_let='^[[:alpha:]](.*[[:alpha:]])$'
)

To replace each string with the name of the pattern it follows, we could do:

m <- lapply(type, grep, x)
rep(names(type), sapply(m, length))[order(unlist(m))]
## [1] "num_let" "num_let" "let_num" "num_num" "num_num" "let_let"

Is there a more efficient approach?


gsubfn?

I know that with gsubfn we can simultaneously replace different matches, e.g.:

library(gsubfn)
gsubfn('.*', list('1p33'='foo', '123abc'='bar'), x)
## [1] "bar"     "78fdsaq" "aq12111" "foo"     "123"     "pzv"

but I'm not sure whether the replacements can be made dependent on the pattern that was matched rather than on the match itself.


stringr?

str_replace_all doesn't play nicely with this example, since matches are replaced for patterns iteratively, and we end up with everything being overwritten with let_let:

library(stringr)
str_replace_all(x, setNames(names(type), unlist(type)))
## [1] "let_let" "let_let" "let_let" "let_let" "let_let" "let_let"

Reordering type so the pattern corresponding to let_let appears first solves the problem, but needing to do this makes me nervous.

type2 <- rev(type)
str_replace_all(x, setNames(names(type2), unlist(type2)))
## [1] "num_let" "num_let" "let_num" "num_num" "num_num" "let_let"
like image 959
jbaums Avatar asked Jan 09 '16 01:01

jbaums


People also ask

How do you replace a pattern in a string?

To perform multiple replacements in each element of string , pass a named vector ( c(pattern1 = replacement1) ) to str_replace_all . Alternatively, pass a function to replacement : it will be called once for each match and its return value will be used to replace the match.

Which function is used to replacing pattern in string?

The replace() method returns a new string with one, some, or all matches of a pattern replaced by a replacement . The pattern can be a string or a RegExp , and the replacement can be a string or a function called for each match. If pattern is a string, only the first occurrence will be replaced.

How do I match a pattern in regex?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" .


2 Answers

Perhaps one of these.

# base R method
mm2 <- character(length(x))
for( n in 1:length(type))  mm2 <- replace(mm2, grep(type[n],x), names(type)[n]) 

# purrr 0.2.0 method
library(purrr)
mm3 <- map(grep, .x=type, x = x) %>% (function(z) replace(x, flatten_int(z), rep(names(type), lengths(z))))

The base R method is somewhat faster than the posted code for both small and larger data sets. The purrr method is slower than the posted code for small data sets but about the same as the base R method for larger data sets.

like image 189
WaltS Avatar answered Oct 18 '22 22:10

WaltS


stringr

We can use str_replace_all if we alter the replacements so they are no longer matched by any of the regular expressions and then add an additional replacement to return them to their original form. For example

library(stringr)
type2 <- setNames(c(str_replace(names(type), "(.*)", "__\\1__"), "\\1"), 
                  c(unlist(type), "^__(.*)__$"))
str_replace_all(x, type2)
## [1] "num_let" "num_let" "let_num" "num_num" "num_num" "let_let"

grepl and tidyr

Another approach is match first and then replace, one way to do this is to use grepl and tidyr

library(plyr)
library(dplyr)
library(tidyr)

out <- data.frame(t(1*aaply(type, 1, grepl, x)))

out[out == 0] <- NA
out <- out %>% 
  mutate(id = 1:nrow(.)) %>%
  gather(name,value, -id, na.rm = T) %>%
  select(name)
as.character(out[,1])
## [1] "num_let" "num_let" "num_num" "num_num" "let_num" "let_let"

While this approach doesn't look as efficient it makes it easy to find rows where there are more or less than one match.


From what I understand substitution matching is implemented in pcre2 and I believe allows this type of problem to be solved directly in the regex. Unfortunately it seems that no one has built a pcre2 package for R yet.

like image 41
NGaffney Avatar answered Oct 18 '22 22:10

NGaffney