What R function to use for regex capture groups?

Tags:

r

I am doing some text wrangling in R, and for a specific extraction I need to use a capture group. For some reason the base/stringr functions I am familiar with don't seem to support capture groups:

str_extract("abcd123asdc", pattern = "([0-9]{3}).+$") 
# Returns: "123asdc"

stri_extract(str = "abcd123asdc", regex = "([0-9]{3}).+$")
# Returns: "123asdc"

grep(x = "abcd123asdc", pattern = "([0-9]{3}).+$", value = TRUE)
# Returns: "abcd123asdc"

The usual googling for "R capture group regex" doesn't give any useful hits for solutions to this problem. Am I missing something, or are capture groups not implemented in R?

EDIT: So after trying to solution suggested in the comments, which works on a small example, it fails for my situation.

Note this is a text from the enron emails dataset, so doesn't contain sensitive information.

txt <- "Message-ID: <24216240.1075855687451.JavaMail.evans@thyme>
Date: Wed, 18 Oct 2000 03:00:00 -0700 (PDT)
From: [email protected]
To: [email protected]
Subject: Re: test
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Leah Van Arsdall
X-cc: 
X-bcc: 
X-Folder: \\Phillip_Allen_Dec2000\\Notes Folders\\sent mail   
X-Origin: Allen-P
X-FileName: pallen.nsf

test successful.  way to go!!!"

sub("X-FileName:.+\n\n([\\W\\w]+)$", "\\1", txt)
# Returns all of "txt", not the capture group

Since we only have a single capture group, shouldn't the "\1" capture it? I tested the regex with an online regex tester and it should be working. Also tried both \n and \n for the newlines. Any ideas?

633

asked May 14 '17 20:05

BallzofFury

1 Answers

Getting job done

You may always extract capture groups with stringr using str_match or str_match_all:

> result <- str_match(txt, "X-FileName:.+\n\n(?s)(.+)$")
> result[,2]
[1] "test successful.  way to go!!!"

Pattern details:

X-FileName: - a literal substring
.+ - any 1+ chars other than line break (since in ICU regex, a dot does not match a line break char)
\n\n - 2 newline symbols
(?s) - an inline DOTALL modifier (now, . that occurs to the right will match a line break char)
(.+) - Group 1 capturing any 1+ chars (incl. line breaks) up to
$ - the end of string.

Or you may use base R regmatches with regexec:

> result <- regmatches(txt, regexec("X-FileName:[^\n]+\n\n(.+)$", txt))
> result[[1]][2]
[1] "test successful.  way to go!!!"

See the online R demo. Here, a TRE regex is used (with regexec, one can't use PCRE regex unfortunately), so . will match any character including a line break char, thus, the pattern will look like X-FileName:[^\n]+\n\n(.+)$:

X-FileName: - a literal string
[^\n]+ - 1+ chars other than newline
\n\n - 2 newlines
(.+) - any 1+ chars (including line break chars), as many as possible, up to
$ - the end of string.

A sub option can also be considered:

sub(".*X-FileName:[^\n]+\n\n", "", txt)
[1] "test successful.  way to go!!!"

See this R demo. Here, .* matches any 0+ chars, as many as possible (all the string), then backtracks to find X-FileName: substring, [^\n]+ matches 1+ chars other than a newline, and then \n\n match 2 newlines.

Comparing peformance

Taking into account hwnd's comment, I added a TRE regex based sub option above, and it seems the fastest from all 4 options suggested, with str_match being almost as fast as my above sub code:

library(microbenchmark)

f1 <- function(text) { return(str_match(txt, "X-FileName:.+\n\n(?s)(.+)$")[,2]) }
f2 <- function(text) { return(regmatches(txt, regexec("X-FileName:[^\n]+\n\n(.+)$", txt))[[1]][2]) }
f3 <- function(text) { return(sub('(?s).*X-FileName:[^\n]+\\R+', '', txt, perl=TRUE)) }
f4 <- function(text) { return(sub('.*X-FileName:[^\n]+\n\n', '', txt)) }

> test <- microbenchmark( f1(txt), f2(txt), f3(txt), f4(txt), times = 500000 )
> test
Unit: microseconds
    expr    min     lq     mean median     uq       max neval  cld
 f1(txt) 21.130 24.451 28.08150 27.168 28.677 53796.565 5e+05  b  
 f2(txt) 29.280 32.903 37.46800 35.318 37.431 54556.635 5e+05   c 
 f3(txt) 57.655 59.466 63.36906 60.674 61.881  1651.448 5e+05    d
 f4(txt) 22.036 23.545 25.56820 24.451 25.356  1660.504 5e+05 a

answered Nov 12 '22 09:11

Wiktor Stribiżew

Related questions
                            
                                Remove a single x-axis tick mark in ggplot2 in R?
                            
                                R Plotly pie chart custom colors
                            
                                R: Emulating a complex form with httr
                            
                                R corrplot colors range
                            
                                geom_dotplot() loses dodge after applying colour aesthetics
                            
                                Understanding evaluation of input arguments of functions
                            
                                Changing scale of the ROC chart
                            
                                How do I use tagList() in a Shiny module?
                            
                                how to export tm object without chart borders
                            
                                Changing factor levels on a column with setattr is sensitive for how the column was created
                            
                                How to compute rowSums in rcpp
                            
                                named Element-wise operations in R
                            
                                Read table with comment lines starting with "##"
                            
                                get nearest data from dataframe in R [duplicate]
                            
                                Fill missing values in data.frame using dplyr complete within groups
                            
                                R - ggplot2 'dodge' geom_step() to overlap geom_bar()
                            
                                error with tidyr::gather() when I have unique names
                            
                                R: Apply function to matrix with elements of vector as argument
                            
                                Errors in makeCluster(multicore): cannot open the connection
                            
                                Adding column to sqlite database

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With