Given a set of regular expressions, is there a simple way to match multiple patterns, and replace the matched text according to the pattern that was matched?
For example, for the following data x
, each element begins with either a number or a letter, and ends with either a number or a letter. Let's call these patterns num_num
(for begins with number, ends with number), num_let
(begins with number, ends with letter), let_num
, and let_let
.
x <- c('123abc', '78fdsaq', 'aq12111', '1p33', '123', 'pzv')
type <- list(
num_let='^\\d.*[[:alpha:]]$',
num_num='^\\d(.*\\d)?$',
let_num='^[[:alpha:]].*\\d$',
let_let='^[[:alpha:]](.*[[:alpha:]])$'
)
To replace each string with the name of the pattern it follows, we could do:
m <- lapply(type, grep, x)
rep(names(type), sapply(m, length))[order(unlist(m))]
## [1] "num_let" "num_let" "let_num" "num_num" "num_num" "let_let"
Is there a more efficient approach?
gsubfn
?I know that with gsubfn
we can simultaneously replace different matches, e.g.:
library(gsubfn)
gsubfn('.*', list('1p33'='foo', '123abc'='bar'), x)
## [1] "bar" "78fdsaq" "aq12111" "foo" "123" "pzv"
but I'm not sure whether the replacements can be made dependent on the pattern that was matched rather than on the match itself.
stringr
?str_replace_all
doesn't play nicely with this example, since matches are replaced for patterns iteratively, and we end up with everything being overwritten with let_let
:
library(stringr)
str_replace_all(x, setNames(names(type), unlist(type)))
## [1] "let_let" "let_let" "let_let" "let_let" "let_let" "let_let"
Reordering type
so the pattern corresponding to let_let
appears first solves the problem, but needing to do this makes me nervous.
type2 <- rev(type)
str_replace_all(x, setNames(names(type2), unlist(type2)))
## [1] "num_let" "num_let" "let_num" "num_num" "num_num" "let_let"
To perform multiple replacements in each element of string , pass a named vector ( c(pattern1 = replacement1) ) to str_replace_all . Alternatively, pass a function to replacement : it will be called once for each match and its return value will be used to replace the match.
The replace() method returns a new string with one, some, or all matches of a pattern replaced by a replacement . The pattern can be a string or a RegExp , and the replacement can be a string or a function called for each match. If pattern is a string, only the first occurrence will be replaced.
To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" .
Perhaps one of these.
# base R method
mm2 <- character(length(x))
for( n in 1:length(type)) mm2 <- replace(mm2, grep(type[n],x), names(type)[n])
# purrr 0.2.0 method
library(purrr)
mm3 <- map(grep, .x=type, x = x) %>% (function(z) replace(x, flatten_int(z), rep(names(type), lengths(z))))
The base R method is somewhat faster than the posted code for both small and larger data sets. The purrr
method is slower than the posted code for small data sets but about the same as the base R method for larger data sets.
We can use str_replace_all
if we alter the replacements so they are no longer matched by any of the regular expressions and then add an additional replacement to return them to their original form. For example
library(stringr)
type2 <- setNames(c(str_replace(names(type), "(.*)", "__\\1__"), "\\1"),
c(unlist(type), "^__(.*)__$"))
str_replace_all(x, type2)
## [1] "num_let" "num_let" "let_num" "num_num" "num_num" "let_let"
Another approach is match first and then replace, one way to do this is to use grepl
and tidyr
library(plyr)
library(dplyr)
library(tidyr)
out <- data.frame(t(1*aaply(type, 1, grepl, x)))
out[out == 0] <- NA
out <- out %>%
mutate(id = 1:nrow(.)) %>%
gather(name,value, -id, na.rm = T) %>%
select(name)
as.character(out[,1])
## [1] "num_let" "num_let" "num_num" "num_num" "let_num" "let_let"
While this approach doesn't look as efficient it makes it easy to find rows where there are more or less than one match.
From what I understand substitution matching is implemented in pcre2 and I believe allows this type of problem to be solved directly in the regex. Unfortunately it seems that no one has built a pcre2 package for R yet.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With