Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: fastest way to extract all substrings contained between two substrings

I am on the lookout for an efficient way to extract all matches between two substrings in a character string. E.g. say I want to extract all substrings contained between string

start="strt"

and

stop="stp"
in string
x="strt111stpblablastrt222stp"

I would like to get vector

"111" "222"

What is the most efficient way to do this in R? Using a regular expression perhaps? Or are there better ways?

like image 321
Tom Wenseleers Avatar asked Jul 16 '14 06:07

Tom Wenseleers


2 Answers

For something simple like this, base R handles this just fine.

You can switch on PCRE by using perl=T and use lookaround assertions.

x <- 'strt111stpblablastrt222stp'
regmatches(x, gregexpr('(?<=strt).*?(?=stp)', x, perl=T))[[1]]
# [1] "111" "222"

Explanation:

(?<=          # look behind to see if there is:
  strt        #   'strt'
)             # end of look-behind
.*?           # any character except \n (0 or more times)
(?=           # look ahead to see if there is:
  stp         #   'stp'
)             # end of look-ahead

EDIT: Updated below answers according to the new syntax.

You may also consider using the stringi package.

library(stringi)
x <- 'strt111stpblablastrt222stp'
stri_extract_all_regex(x, '(?<=strt).*?(?=stp)')[[1]]
# [1] "111" "222"

And rm_between from the qdapRegex package.

library(qdapRegex)
x <- 'strt111stpblablastrt222stp'
rm_between(x, 'strt', 'stp', extract=TRUE)[[1]]
# [1] "111" "222"
like image 112
hwnd Avatar answered Oct 11 '22 09:10

hwnd


If you are talking about speed in R strings there is only one package to do this - stringi

 x <- "strt111stpblablastrt222stp"
 hwnd <- function(x1) regmatches(x1,gregexpr('(?<=strt).*?(?=stp)',x1,perl=T))
 Tim <- function(x1) regmatches(x1, gregexpr("(?<=strt)(?:(?!stp).)*", x1, perl=TRUE))
 stringr <- function(x1) str_extract_all(x1, perl('(?<=strt).*?(?=stp)'))
 akrun <- function(x1) genXtract(x1, "strt", "stp")
 stringi <- function(x1) stri_extract_all_regex(x1, perl('(?<=strt).*?(?=stp)'))

 require(microbenchmark)
 microbenchmark(stringi(x), hwnd(x), Tim(x), stringr(x))
Unit: microseconds
       expr     min       lq  median       uq     max neval
 stringi(x)  46.778  58.1030  64.017  67.3485 123.398   100
    hwnd(x)  61.498  73.1095  79.084  85.5190 111.757   100
     Tim(x)  60.243  74.6830  80.755  86.3370 102.678   100
 stringr(x) 236.081 261.9425 272.115 279.6750 440.036   100

Unfortunately I couldn't test @akrun solution because qdap package has some errors during installation. And only his solution looks like the one that can beat stringi...

like image 6
bartektartanus Avatar answered Oct 11 '22 11:10

bartektartanus