R: fastest way to extract all substrings contained between two substrings

Question

I am on the lookout for an efficient way to extract all matches between two substrings in a character string. E.g. say I want to extract all substrings contained between string

start="strt"

and

stop="stp"
in string
x="strt111stpblablastrt222stp"

I would like to get vector

"111" "222"

What is the most efficient way to do this in R? Using a regular expression perhaps? Or are there better ways?

hwnd · Accepted Answer

For something simple like this, base R handles this just fine.

You can switch on PCRE by using perl=T and use lookaround assertions.

x <- 'strt111stpblablastrt222stp'
regmatches(x, gregexpr('(?<=strt).*?(?=stp)', x, perl=T))[[1]]
# [1] "111" "222"

Explanation:

(?<=          # look behind to see if there is:
  strt        #   'strt'
)             # end of look-behind
.*?           # any character except 
 (0 or more times)
(?=           # look ahead to see if there is:
  stp         #   'stp'
)             # end of look-ahead

EDIT: Updated below answers according to the new syntax.

You may also consider using the stringi package.

library(stringi)
x <- 'strt111stpblablastrt222stp'
stri_extract_all_regex(x, '(?<=strt).*?(?=stp)')[[1]]
# [1] "111" "222"

And rm_between from the qdapRegex package.

library(qdapRegex)
x <- 'strt111stpblablastrt222stp'
rm_between(x, 'strt', 'stp', extract=TRUE)[[1]]
# [1] "111" "222"

bartektartanus · Answer

If you are talking about speed in R strings there is only one package to do this - stringi

 x <- "strt111stpblablastrt222stp"
 hwnd <- function(x1) regmatches(x1,gregexpr('(?<=strt).*?(?=stp)',x1,perl=T))
 Tim <- function(x1) regmatches(x1, gregexpr("(?<=strt)(?:(?!stp).)*", x1, perl=TRUE))
 stringr <- function(x1) str_extract_all(x1, perl('(?<=strt).*?(?=stp)'))
 akrun <- function(x1) genXtract(x1, "strt", "stp")
 stringi <- function(x1) stri_extract_all_regex(x1, perl('(?<=strt).*?(?=stp)'))

 require(microbenchmark)
 microbenchmark(stringi(x), hwnd(x), Tim(x), stringr(x))
Unit: microseconds
       expr     min       lq  median       uq     max neval
 stringi(x)  46.778  58.1030  64.017  67.3485 123.398   100
    hwnd(x)  61.498  73.1095  79.084  85.5190 111.757   100
     Tim(x)  60.243  74.6830  80.755  86.3370 102.678   100
 stringr(x) 236.081 261.9425 272.115 279.6750 440.036   100

Unfortunately I couldn't test @akrun solution because qdap package has some errors during installation. And only his solution looks like the one that can beat stringi...

R: fastest way to extract all substrings contained between two substrings

Tags:

string

substring

regex

r

Tom Wenseleers

2 Answers

hwnd

bartektartanus

Recent Activity

Donate For Us

R: fastest way to extract all substrings contained between two substrings

Tags:

string

substring

regex

r

Tom Wenseleers

2 Answers

hwnd

bartektartanus

Related questions

Recent Activity

Donate For Us