Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to extract multiple overlapping strings from a string using stringr?

Tags:

regex

r

stringr

I am running the following code:

str_extract_all("AAAAAAAAAAAAAAAXAAAAAAAAAXBAAAAAAAAA", ".{5}X.{5}")

but I only get one string back. However, if I rerun the same code with 4 elements each side, I get two strings as expected. So I understand the problem is the extracted strings will overlap on their sides (9 characters length between the "X"). This behaviour seems not to be documented in ?str_extract_all. Any suggestions how I can get all the strings, even if their ends overlap?

like image 749
Pavel Shliaha Avatar asked Oct 17 '25 08:10

Pavel Shliaha


2 Answers

We can do that using positive lookahead since it does not consume the string when matched.

string <- "AAAAAAAAAAAAAAAXAAAAAAAAAXBAAAAAAAAA"
stringr::str_match_all(string, "(?=(.{5}X.{5}))")[[1]][, 2]
#[1] "AAAAAXAAAAA" "AAAAAXBAAAA"
like image 150
Ronak Shah Avatar answered Oct 19 '25 22:10

Ronak Shah


We can get around this unfortunate feature as follows:
Let's give the ugly string a name, and find out the position of the X's

library(stringr)
aax <- "AAAAAAAAAAAAAAAXAAAAAAAAAXBAAAAAAAAAX"
x.mtrx <- str_locate_all(aax, "(?x) (?<=.{5}) X (?=.{5})")[[1]] 

Since we're only passing one string, we only want the [[1]] element of the result, which is a matrix. [Perl style lets me put space in my regex, which quickly becomes illegible otherwise.]

# R > x.mtrx
# start end
# [1,]    16  16
# [2,]    26  26

Split the matrix into single rows (of start + stop positions, which are the same for a single-character X.) Use that to extract the string from aax.

split(x.mtrx, seq(nrow(x.mtrx))) %>% 
  map_chr(~ str_sub(aax, start = .x[1] - 5, end = .x[2] + 5) )

            1             2 
"AAAAAXAAAAA" "AAAAAXBAAAA" 

Notice that the terminal X wasn't captured, because it didn't have 5 chars beyond it.

like image 26
David T Avatar answered Oct 19 '25 22:10

David T



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!