Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract text between variable delimiters

Tags:

regex

r

I have text with large numbers of special characters that I want to extract certain substrings from:

y <- c("some stuff <rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep> some stuff <#> <dir> where is Londonderry?</dir>",
       "some stuff <rep> <[> But it 's 1nOt an overflow of Belfast% </rep> <#> potentially more stuff <rep> I 1lIved in Lisburn </rep>",
       "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa> blah blub <icu> Yeah </icu>")

I want to extract whatever comes between 'tag' like substrings, either <dir> ...</dir> or <rep> ...</rep> or <icu> ...</icu> and so on:

With this regex I'm modestly successful:

library(stringr)
lapply(y, function(x) paste0(unlist(str_extract_all(x, "<([a-z]{3})>(?!<\\1>).*</\\1>")), collapse = ", "))
[[1]]
[1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>, <dir> where is Londonderry?</dir>"

[[2]]
[1] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep> <#> potentially more stuff <rep> I 1lIved in Lisburn </rep>"

[[3]]
[1] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>, <icu> Yeah </icu>"

Just [[2]]isn't as expected: there is still unwanted material (namely <#> potentially more stuff) and the two occurrences of <rep> ...</rep> substrings are not separated by ,. My hunch is that my regex fails here because the two tags are the same rather than different.

How can the regex be improved so that this expected result is obtained:

Expected result:

[[1]]
[1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>, <dir> where is Londonderry?</dir>"

[[2]]
[1] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep>, <rep> I 1lIved in Lisburn </rep>"

[[3]]
[1] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>, <icu> Yeah </icu>"

EDIT:

I've found a viable solution in the meantime:

lapply(y, function(x) paste0(unlist(str_extract_all(x, "<([a-z]{3})>.*?</\\1>")), collapse = ", "))
like image 586
Chris Ruehlemann Avatar asked May 09 '26 02:05

Chris Ruehlemann


2 Answers

How about this?

unlist(str_extract_all(y, "\\<([A-Za-z0-9_]+\\>).*?(\\<\\/\\1)"))

# [1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>" "<dir> where is Londonderry?</dir>"                         
# [3] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep>"    "<rep> I 1lIved in Lisburn </rep>"                          
# [5] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>"    "<icu> Yeah </icu>"     

Basically all we're doing here is putting the (opening) tag's body (+ its tailing angular bracket) in a capture group, and using that capture group to define the closing tag as well. Then we capture everything between those two instances of said capture group(s). So something like: <(tag>)whatever<\\1 where \1 is tag>.

Edit:

I guess this should do it:

lapply(str_extract_all(y, "\\<([A-Za-z0-9]+)\\>.*?\\<\\/\\1\\>"), paste, collapse = ", ")

# [[1]]
# [1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>, <dir> where is Londonderry?</dir>"

# [[2]]
# [1] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep>, <rep> I 1lIved in Lisburn </rep>"

# [[3]]
# [1] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>, <icu> Yeah </icu>"
like image 128
Dunois Avatar answered May 10 '26 17:05

Dunois


library(gsubfn)
a1 <- strapplyc(y, "<dir>(.*?)</dir>", simplify = c)
a2 <- strapplyc(y, "<rep>(.*?)</rep>", simplify = c)
a3 <- strapplyc(y, "<icu>(.*?)</icu>", simplify = c)

a1
a2
a3

# output:
> a1
[1] " where is Londonderry?"
> a2
[1] " I 1knOw 2LondondErry is bigger than 2LIsburn% " " <[> But it 's 1nOt an overflow of Belfast% "   
[3] " I 1lIved in Lisburn "                          
> a3
[1] " Yeah "
like image 25
TarJae Avatar answered May 10 '26 16:05

TarJae



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!