Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex group capture in R with multiple capture-groups

In R, is it possible to extract group capture from a regular expression match? As far as I can tell, none of grep, grepl, regexpr, gregexpr, sub, or gsub return the group captures.

I need to extract key-value pairs from strings that are encoded thus:

\((.*?) :: (0\.[0-9]+)\) 

I can always just do multiple full-match greps, or do some outside (non-R) processing, but I was hoping I can do it all within R. Is there's a function or a package that provides such a function to do this?

like image 352
Daniel Dickison Avatar asked Jun 04 '09 18:06

Daniel Dickison


People also ask

How do Capturing groups work in regex?

Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses. For example, the regular expression (dog) creates a single group containing the letters "d" "o" and "g" .

How do I match a group in regex?

Regular expressions allow us to not just match text but also to extract information for further processing. This is done by defining groups of characters and capturing them using the special parentheses ( and ) metacharacters. Any subpattern inside a pair of parentheses will be captured as a group.

What is regex grouping?

What is Group in Regex? A group is a part of a regex pattern enclosed in parentheses () metacharacter. We create a group by placing the regex pattern inside the set of parentheses ( and ) . For example, the regular expression (cat) creates a single group containing the letters 'c', 'a', and 't'.

What is non-capturing group in regex?

Non-capturing groups are important constructs within Java Regular Expressions. They create a sub-pattern that functions as a single unit but does not save the matched character sequence. In this tutorial, we'll explore how to use non-capturing groups in Java Regular Expressions.


2 Answers

str_match(), from the stringr package, will do this. It returns a character matrix with one column for each group in the match (and one for the whole match):

> s = c("(sometext :: 0.1231313213)", "(moretext :: 0.111222)") > str_match(s, "\\((.*?) :: (0\\.[0-9]+)\\)")      [,1]                         [,2]       [,3]           [1,] "(sometext :: 0.1231313213)" "sometext" "0.1231313213" [2,] "(moretext :: 0.111222)"     "moretext" "0.111222"     
like image 166
Kent Johnson Avatar answered Sep 19 '22 13:09

Kent Johnson


gsub does this, from your example:

gsub("\\((.*?) :: (0\\.[0-9]+)\\)","\\1 \\2", "(sometext :: 0.1231313213)") [1] "sometext 0.1231313213" 

you need to double escape the \s in the quotes then they work for the regex.

Hope this helps.

like image 32
David Lawrence Miller Avatar answered Sep 19 '22 13:09

David Lawrence Miller