Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract capture group matches from regular expressions? (or: where is gregexec?)

Given a regular expression containing capture groups (parentheses) and a string, how can I obtain all the substrings matching the capture groups, i.e., the substrings usually referenced by "\1", "\2"?

Example: consider a regex capturing digits preceded by "xy":

s <- "xy1234wz98xy567"

r <- "xy(\\d+)"

Desired result:

[1] "1234" "567" 

First attempt: gregexpr:

regmatches(s,gregexpr(r,s))
#[[1]]
#[1] "xy1234" "xy567" 

Not what I want because it returns the substrings matching the entire pattern.

Second try: regexec:

regmatches(s,regexec("xy(\\d+)",s))
#[[1]]
#[1] "xy1234" "1234" 

Not what I want because it returns only the first occurence of a matching for the entire pattern and the capture group.

If there was a gregexec function, extending regexec as gregexpr extends regexpr, my problem would be solved.

So the question is: how to retrieve all substrings (or indices that can be passed to regmatches as in the examples above) matching capture groups in an arbitrary regular expression?

Note: the pattern for r given above is just a silly example, it must remain arbitrary.

like image 956
Ferdinand.kraft Avatar asked Sep 04 '13 17:09

Ferdinand.kraft


2 Answers

For a base R solution, what about just using gsub() to finish processing the strings extracted by gregexpr() and regmatches()?

s <- "xy1234wz98xy567"
r <- "xy(\\d+)"

gsub(r, "\\1", regmatches(s,gregexpr(r,s))[[1]])
# [1] "1234" "567" 
like image 100
Josh O'Brien Avatar answered Oct 08 '22 17:10

Josh O'Brien


Not sure about doing this in base, but here's a package for your needs:

library(stringr)

str_match_all(s, r)
#[[1]]
#     [,1]     [,2]  
#[1,] "xy1234" "1234"
#[2,] "xy567"  "567" 

Many stringr functions also have parallels in base R, so you can also achieve this without using stringr.

For instance, here's a simplified version of how the above works, using base R:

sapply(regmatches(s,gregexpr(r,s))[[1]], function(m) regmatches(m,regexec(r,m)))
like image 25
eddi Avatar answered Oct 08 '22 17:10

eddi