Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you capture a group with regex?

Tags:

c

regex

posix

I'm trying to extract a string from another using regex. I'm using the POSIX regex functions (regcomp, regexec ...), and I fail at capturing a group ...

For instance, let the pattern be something as simple as "MAIL FROM:<(.*)>"
(with REG_EXTENDED cflags)

I want to capture everything between '<' and '>'

My problem is that regmatch_t gives me the boundaries of the whole pattern (MAIL FROM:<...>) instead of just what's between the parenthesis ...

What am I missing ?

Thanks in advance,

edit: some code

#define SENDER_REGEX "MAIL FROM:<(.*)>"  int main(int ac, char **av) {   regex_t regex;   int status;   regmatch_t pmatch[1];    if (regcomp(&regex, SENDER_REGEX, REG_ICASE|REG_EXTENDED) != 0)     printf("regcomp error\n");   status = regexec(&regex, av[1], 1, pmatch, 0);   regfree(&regex);   if (!status)       printf(  "matched from %d (%c) to %d (%c)\n"              , pmatch[0].rm_so              , av[1][pmatch[0].rm_so]              , pmatch[0].rm_eo              , av[1][pmatch[0].rm_eo]             );    return (0); } 

outputs:

$./a.out "012345MAIL FROM:<abcd>$" matched from 6 (M) to 22 ($) 

solution:

as RarrRarrRarr said, the indices are indeed in pmatch[1].rm_so and pmatch[1].rm_eo
hence regmatch_t pmatch[1]; becomes regmatch_t pmatch[2];
and regexec(&regex, av[1], 1, pmatch, 0); becomes regexec(&regex, av[1], 2, pmatch, 0);

Thanks :)

like image 639
Sylvain Avatar asked Apr 05 '10 06:04

Sylvain


People also ask

How do regex capture groups work?

Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses. For example, the regular expression (dog) creates a single group containing the letters "d" "o" and "g" .

What is first capturing group in regex?

First group matches abc. Escaped parentheses group the regex between them. They capture the text matched by the regex inside them into a numbered group that can be reused with a numbered backreference. They allow you to apply regex operators to the entire grouped regex.

What is capturing group in regex Javascript?

Groups group multiple patterns as a whole, and capturing groups provide extra submatch information when using a regular expression pattern to match against a string. Backreferences refer to a previously captured group in the same regular expression.

WHAT IS group in regex match?

Regular expressions allow us to not just match text but also to extract information for further processing. This is done by defining groups of characters and capturing them using the special parentheses ( and ) metacharacters. Any subpattern inside a pair of parentheses will be captured as a group.


2 Answers

Here's a code example that demonstrates capturing multiple groups.

You can see that group '0' is the whole match, and subsequent groups are the parts within parentheses.

Note that this will only capture the first match in the source string. Here's a version that captures multiple groups in multiple matches.

#include <stdio.h> #include <string.h> #include <regex.h>  int main () {   char * source = "___ abc123def ___ ghi456 ___";   char * regexString = "[a-z]*([0-9]+)([a-z]*)";   size_t maxGroups = 3;    regex_t regexCompiled;   regmatch_t groupArray[maxGroups];    if (regcomp(&regexCompiled, regexString, REG_EXTENDED))     {       printf("Could not compile regular expression.\n");       return 1;     };    if (regexec(&regexCompiled, source, maxGroups, groupArray, 0) == 0)     {       unsigned int g = 0;       for (g = 0; g < maxGroups; g++)         {           if (groupArray[g].rm_so == (size_t)-1)             break;  // No more groups            char sourceCopy[strlen(source) + 1];           strcpy(sourceCopy, source);           sourceCopy[groupArray[g].rm_eo] = 0;           printf("Group %u: [%2u-%2u]: %s\n",                  g, groupArray[g].rm_so, groupArray[g].rm_eo,                  sourceCopy + groupArray[g].rm_so);         }     }    regfree(&regexCompiled);    return 0; } 

Output:

Group 0: [ 4-13]: abc123def Group 1: [ 7-10]: 123 Group 2: [10-13]: def 
like image 97
Ian Mackinnon Avatar answered Oct 02 '22 07:10

Ian Mackinnon


The 0th element of the pmatch array of regmatch_t structs will contain the boundaries of the whole string matched, as you have noticed. In your example, you are interested in the regmatch_t at index 1, not at index 0, in order to get information about the string matches by the subexpression.

If you need more help, try editing your question to include an actual small code sample so that people can more easily spot the problem.

like image 32
RarrRarrRarr Avatar answered Oct 02 '22 06:10

RarrRarrRarr