Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What R function to use for regex capture groups?

Tags:

regex

r

I am doing some text wrangling in R, and for a specific extraction I need to use a capture group. For some reason the base/stringr functions I am familiar with don't seem to support capture groups:

str_extract("abcd123asdc", pattern = "([0-9]{3}).+$") 
# Returns: "123asdc"

stri_extract(str = "abcd123asdc", regex = "([0-9]{3}).+$")
# Returns: "123asdc"

grep(x = "abcd123asdc", pattern = "([0-9]{3}).+$", value = TRUE)
# Returns: "abcd123asdc"

The usual googling for "R capture group regex" doesn't give any useful hits for solutions to this problem. Am I missing something, or are capture groups not implemented in R?

EDIT: So after trying to solution suggested in the comments, which works on a small example, it fails for my situation.

Note this is a text from the enron emails dataset, so doesn't contain sensitive information.

txt <- "Message-ID: <24216240.1075855687451.JavaMail.evans@thyme>
Date: Wed, 18 Oct 2000 03:00:00 -0700 (PDT)
From: [email protected]
To: [email protected]
Subject: Re: test
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Leah Van Arsdall
X-cc: 
X-bcc: 
X-Folder: \\Phillip_Allen_Dec2000\\Notes Folders\\sent mail   
X-Origin: Allen-P
X-FileName: pallen.nsf

test successful.  way to go!!!"

sub("X-FileName:.+\n\n([\\W\\w]+)$", "\\1", txt)
# Returns all of "txt", not the capture group

Since we only have a single capture group, shouldn't the "\1" capture it? I tested the regex with an online regex tester and it should be working. Also tried both \n and \n for the newlines. Any ideas?

like image 633
BallzofFury Avatar asked May 14 '17 20:05

BallzofFury


People also ask

How does group work in regex?

What is Group in Regex? A group is a part of a regex pattern enclosed in parentheses () metacharacter. We create a group by placing the regex pattern inside the set of parentheses ( and ) . For example, the regular expression (cat) creates a single group containing the letters 'c', 'a', and 't'.

What is regex function in R?

Details. A 'regular expression' is a pattern that describes a set of strings. Two types of regular expressions are used in R, extended regular expressions (the default) and Perl-like regular expressions used by perl = TRUE . There is also fixed = TRUE which can be considered to use a literal regular expression.

What method should you use when you want to get all sequences matching a regex pattern in a string?

To find all the matching strings, use String's scan method.


1 Answers

Getting job done

You may always extract capture groups with stringr using str_match or str_match_all:

> result <- str_match(txt, "X-FileName:.+\n\n(?s)(.+)$")
> result[,2]
[1] "test successful.  way to go!!!"

Pattern details:

  • X-FileName: - a literal substring
  • .+ - any 1+ chars other than line break (since in ICU regex, a dot does not match a line break char)
  • \n\n - 2 newline symbols
  • (?s) - an inline DOTALL modifier (now, . that occurs to the right will match a line break char)
  • (.+) - Group 1 capturing any 1+ chars (incl. line breaks) up to
  • $ - the end of string.

Or you may use base R regmatches with regexec:

> result <- regmatches(txt, regexec("X-FileName:[^\n]+\n\n(.+)$", txt))
> result[[1]][2]
[1] "test successful.  way to go!!!"

See the online R demo. Here, a TRE regex is used (with regexec, one can't use PCRE regex unfortunately), so . will match any character including a line break char, thus, the pattern will look like X-FileName:[^\n]+\n\n(.+)$:

  • X-FileName: - a literal string
  • [^\n]+ - 1+ chars other than newline
  • \n\n - 2 newlines
  • (.+) - any 1+ chars (including line break chars), as many as possible, up to
  • $ - the end of string.

A sub option can also be considered:

sub(".*X-FileName:[^\n]+\n\n", "", txt)
[1] "test successful.  way to go!!!"

See this R demo. Here, .* matches any 0+ chars, as many as possible (all the string), then backtracks to find X-FileName: substring, [^\n]+ matches 1+ chars other than a newline, and then \n\n match 2 newlines.

Comparing peformance

Taking into account hwnd's comment, I added a TRE regex based sub option above, and it seems the fastest from all 4 options suggested, with str_match being almost as fast as my above sub code:

library(microbenchmark)

f1 <- function(text) { return(str_match(txt, "X-FileName:.+\n\n(?s)(.+)$")[,2]) }
f2 <- function(text) { return(regmatches(txt, regexec("X-FileName:[^\n]+\n\n(.+)$", txt))[[1]][2]) }
f3 <- function(text) { return(sub('(?s).*X-FileName:[^\n]+\\R+', '', txt, perl=TRUE)) }
f4 <- function(text) { return(sub('.*X-FileName:[^\n]+\n\n', '', txt)) }

> test <- microbenchmark( f1(txt), f2(txt), f3(txt), f4(txt), times = 500000 )
> test
Unit: microseconds
    expr    min     lq     mean median     uq       max neval  cld
 f1(txt) 21.130 24.451 28.08150 27.168 28.677 53796.565 5e+05  b  
 f2(txt) 29.280 32.903 37.46800 35.318 37.431 54556.635 5e+05   c 
 f3(txt) 57.655 59.466 63.36906 60.674 61.881  1651.448 5e+05    d
 f4(txt) 22.036 23.545 25.56820 24.451 25.356  1660.504 5e+05 a   
like image 62
Wiktor Stribiżew Avatar answered Nov 12 '22 09:11

Wiktor Stribiżew