Extract text between variable delimiters

Question

I have text with large numbers of special characters that I want to extract certain substrings from:

y <- c("some stuff <rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep> some stuff <#> <dir> where is Londonderry?</dir>",
       "some stuff <rep> <[> But it 's 1nOt an overflow of Belfast% </rep> <#> potentially more stuff <rep> I 1lIved in Lisburn </rep>",
       "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa> blah blub <icu> Yeah </icu>")

I want to extract whatever comes between 'tag' like substrings, either <dir> ...</dir> or <rep> ...</rep> or <icu> ...</icu> and so on:

With this regex I'm modestly successful:

library(stringr)
lapply(y, function(x) paste0(unlist(str_extract_all(x, "<([a-z]{3})>(?!<\1>).*</\1>")), collapse = ", "))
[[1]]
[1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>, <dir> where is Londonderry?</dir>"

[[2]]
[1] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep> <#> potentially more stuff <rep> I 1lIved in Lisburn </rep>"

[[3]]
[1] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>, <icu> Yeah </icu>"

Just [[2]]isn't as expected: there is still unwanted material (namely <#> potentially more stuff) and the two occurrences of <rep> ...</rep> substrings are not separated by ,. My hunch is that my regex fails here because the two tags are the same rather than different.

How can the regex be improved so that this expected result is obtained:

Expected result:

[[1]]
[1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>, <dir> where is Londonderry?</dir>"

[[2]]
[1] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep>, <rep> I 1lIved in Lisburn </rep>"

[[3]]
[1] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>, <icu> Yeah </icu>"

EDIT:

I've found a viable solution in the meantime:

lapply(y, function(x) paste0(unlist(str_extract_all(x, "<([a-z]{3})>.*?</\1>")), collapse = ", "))

Dunois · Accepted Answer

How about this?

unlist(str_extract_all(y, "\<([A-Za-z0-9_]+\>).*?(\<\/\1)"))

# [1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>" "<dir> where is Londonderry?</dir>"                         
# [3] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep>"    "<rep> I 1lIved in Lisburn </rep>"                          
# [5] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>"    "<icu> Yeah </icu>"

Basically all we're doing here is putting the (opening) tag's body (+ its tailing angular bracket) in a capture group, and using that capture group to define the closing tag as well. Then we capture everything between those two instances of said capture group(s). So something like: <(tag>)whatever<\1 where \1 is tag>.

Edit:

I guess this should do it:

lapply(str_extract_all(y, "\<([A-Za-z0-9]+)\>.*?\<\/\1\>"), paste, collapse = ", ")

# [[1]]
# [1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>, <dir> where is Londonderry?</dir>"

# [[2]]
# [1] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep>, <rep> I 1lIved in Lisburn </rep>"

# [[3]]
# [1] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>, <icu> Yeah </icu>"

TarJae · Answer

library(gsubfn)
a1 <- strapplyc(y, "<dir>(.*?)</dir>", simplify = c)
a2 <- strapplyc(y, "<rep>(.*?)</rep>", simplify = c)
a3 <- strapplyc(y, "<icu>(.*?)</icu>", simplify = c)

a1
a2
a3

# output:
> a1
[1] " where is Londonderry?"
> a2
[1] " I 1knOw 2LondondErry is bigger than 2LIsburn% " " <[> But it 's 1nOt an overflow of Belfast% "   
[3] " I 1lIved in Lisburn "                          
> a3
[1] " Yeah "

Extract text between variable delimiters

Tags:

regex

r

Chris Ruehlemann

2 Answers

Dunois

TarJae

Recent Activity

Donate For Us

Extract text between variable delimiters

Tags:

regex

r

Chris Ruehlemann

2 Answers

Dunois

TarJae

Related questions

Recent Activity

Donate For Us