I am on the lookout for an efficient way to extract all matches between two substrings in a character string. E.g. say I want to extract all substrings contained between string
start="strt"
and
stop="stp"
in string
x="strt111stpblablastrt222stp"
I would like to get vector
"111" "222"
What is the most efficient way to do this in R? Using a regular expression perhaps? Or are there better ways?
For something simple like this, base R handles this just fine.
You can switch on PCRE by using perl=T
and use lookaround assertions.
x <- 'strt111stpblablastrt222stp'
regmatches(x, gregexpr('(?<=strt).*?(?=stp)', x, perl=T))[[1]]
# [1] "111" "222"
Explanation:
(?<= # look behind to see if there is:
strt # 'strt'
) # end of look-behind
.*? # any character except \n (0 or more times)
(?= # look ahead to see if there is:
stp # 'stp'
) # end of look-ahead
EDIT: Updated below answers according to the new syntax.
You may also consider using the stringi package.
library(stringi)
x <- 'strt111stpblablastrt222stp'
stri_extract_all_regex(x, '(?<=strt).*?(?=stp)')[[1]]
# [1] "111" "222"
And rm_between
from the qdapRegex package.
library(qdapRegex)
x <- 'strt111stpblablastrt222stp'
rm_between(x, 'strt', 'stp', extract=TRUE)[[1]]
# [1] "111" "222"
If you are talking about speed in R strings there is only one package to do this - stringi
x <- "strt111stpblablastrt222stp"
hwnd <- function(x1) regmatches(x1,gregexpr('(?<=strt).*?(?=stp)',x1,perl=T))
Tim <- function(x1) regmatches(x1, gregexpr("(?<=strt)(?:(?!stp).)*", x1, perl=TRUE))
stringr <- function(x1) str_extract_all(x1, perl('(?<=strt).*?(?=stp)'))
akrun <- function(x1) genXtract(x1, "strt", "stp")
stringi <- function(x1) stri_extract_all_regex(x1, perl('(?<=strt).*?(?=stp)'))
require(microbenchmark)
microbenchmark(stringi(x), hwnd(x), Tim(x), stringr(x))
Unit: microseconds
expr min lq median uq max neval
stringi(x) 46.778 58.1030 64.017 67.3485 123.398 100
hwnd(x) 61.498 73.1095 79.084 85.5190 111.757 100
Tim(x) 60.243 74.6830 80.755 86.3370 102.678 100
stringr(x) 236.081 261.9425 272.115 279.6750 440.036 100
Unfortunately I couldn't test @akrun solution because qdap package has some errors during installation. And only his solution looks like the one that can beat stringi...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With