Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C Regular Expressions: Extracting the Actual Matches

I am using regular expressions in C (using the "regex.h" library). After setting up the standard calls (and checks) for regcomp(...) and regexec(...), I can only manage to print the actual substrings that match my compiled regular expression. Using regexec, according to the manual pages, means you store the substring matches in a structure known as "regmatch_t". The struct only contains rm_so and rm_eo to reference what I understand to be the addresses of the characters of the matched substring in memory, but my question is how can I just use these to offsets and two pointers to extract the actual substring and store it into an array (ideally a 2D array of strings)?

It works when you just print to standard out, but whenever you try to use the same setup but store it in a string/character array, it stores the entire string that was originally used to match against the expression. Further, what is the "%.*s" inside the print statement? I imagine it's a regular expression in of itself to read in the pointers to a character array correctly. I just want to store the matched substrings inside a collection so I can work with them elsewhere in my software.

Background: p and p2 are both pointers set to point to the start of string to match before entering the while loop in the code below: [EDIT: "matches" is a 2D array meant to ultimately store the substring matches and was preallocated/initalized before the main loop you see below]

int ind = 0;
while(1){
    regExErr1 = regexec(&r, p, 10, m, 0);
    //printf("Did match regular expr, value %i\n", regExErr1);
    if( regExErr1 != 0 ){ 
        fprintf(stderr, "No more matches with the inherent regular expression!\n"); 
        break; 
    }   
    printf("What was found was: ");
    int i = 0;
    while(1){
        if(m[i].rm_so == -1){
            break;
        }
        int start = m[i].rm_so + (p - p2);
        int finish = m[i].rm_eo + (p - p2);
        strcpy(matches[ind], ("%.*s\n", (finish - start), p2 + start));
        printf("Storing:  %.*s", matches[ind]);
        ind++;
        printf("%.*s\n", (finish - start), p2 + start);
        i++;
    }
    p += m[0].rm_eo; // this will move the pointer p to the end of last matched pattern and on to the start of a new one
}
printf("We have in [0]:  %s\n", temp);
like image 286
9codeMan9 Avatar asked Mar 06 '13 03:03

9codeMan9


People also ask

How do you find the number of matches in a regular expression?

To count a regex pattern multiple times in a given string, use the method len(re. findall(pattern, string)) that returns the number of matching substrings or len([*re. finditer(pattern, text)]) that unpacks all matching substrings into a list and returns the length of it as well.

What is difference [] and () in regex?

[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9.

What will the regular expression match?

By default, regular expressions will match any part of a string. It's often useful to anchor the regular expression so that it matches from the start or end of the string: ^ matches the start of string. $ matches the end of the string.

How does regex work in C?

A regular expression is a sequence of characters that is used to search pattern. It is mainly used for pattern matching with strings, or string matching, etc. They are a generalized way to match patterns with sequences of characters. It is used in every programming language like C++, Java, and Python.


1 Answers

There are quite a lot of regular expression packages, but yours seems to match the one in POSIX: regcomp() etc.

The two structures it defines in <regex.h> are:

  • regex_t containing at least size_t re_nsub, the number of parenthesized subexpressions.

  • regmatch_t containing at least regoff_t rm_so, the byte offset from start of string to start of substring, and regoff_t rm_eo, the byte offset from start of string of the first character after the end of substring.

Note that 'offsets' are not pointers but indexes into the character array.

The execution function is:

  • int regexec(const regex_t *restrict preg, const char *restrict string, size_t nmatch, regmatch_t pmatch[restrict], int eflags);

Your printing code should be:

for (int i = 0; i <= r.re_nsub; i++)
{
    int start = m[i].rm_so;
    int finish = m[i].rm_eo;
//  strcpy(matches[ind], ("%.*s\n", (finish - start), p + start));  // Based on question
    sprintf(matches[ind], "%.*s\n", (finish - start), p + start);   // More plausible code
    printf("Storing:  %.*s\n", (finish - start), matches[ind]);     // Print once
    ind++;
    printf("%.*s\n", (finish - start), p + start);                  // Why print twice?
}

Note that the code should be upgraded to ensure that the string copy (via sprintf()) does not overflow the target string — maybe by using snprintf() instead of sprintf(). It is also a good idea to mark the start and end of a string in the printing. For example:

    printf("<<%.*s>>\n", (finish - start), p + start);

This makes it a whole heap easier to see spaces etc.

[In future, please attempt to provide an MCVE (Minimal, Complete, Verifiable Example) or SSCCE (Short, Self-Contained, Correct Example) so that people can help more easily.]

This is an SSCCE that I created, probably in response to another SO question in 2010. It is one of a number of programs I keep that I call 'vignettes'; little programs that show the essence of some feature (such as POSIX regexes, in this case). I find them useful as memory joggers.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <regex.h>

#define tofind    "^DAEMONS=\\(([^)]*)\\)[ \t]*$"

int main(int argc, char **argv)
{
    FILE *fp;
    char line[1024];
    int retval = 0;
    regex_t re;
    regmatch_t rm[2];
    //this file has this line "DAEMONS=(sysklogd network sshd !netfs !crond)"
    const char *filename = "/etc/rc.conf";

    if (argc > 1)
        filename = argv[1];

    if (regcomp(&re, tofind, REG_EXTENDED) != 0)
    {
        fprintf(stderr, "Failed to compile regex '%s'\n", tofind);
        return EXIT_FAILURE;
    }
    printf("Regex: %s\n", tofind);
    printf("Number of captured expressions: %zu\n", re.re_nsub);

    fp = fopen(filename, "r");
    if (fp == 0)
    {
        fprintf(stderr, "Failed to open file %s (%d: %s)\n", filename, errno, strerror(errno));
        return EXIT_FAILURE;
    }

    while ((fgets(line, 1024, fp)) != NULL)
    {
        line[strcspn(line, "\n")] = '\0';
        if ((retval = regexec(&re, line, 2, rm, 0)) == 0)
        {
            printf("<<%s>>\n", line);
            // Complete match
            printf("Line: <<%.*s>>\n", (int)(rm[0].rm_eo - rm[0].rm_so), line + rm[0].rm_so);
            // Match captured in (...) - the \( and \) match literal parenthesis
            printf("Text: <<%.*s>>\n", (int)(rm[1].rm_eo - rm[1].rm_so), line + rm[1].rm_so);
            char *src = line + rm[1].rm_so;
            char *end = line + rm[1].rm_eo;
            while (src < end)
            {
                size_t len = strcspn(src, " ");
                if (src + len > end)
                    len = end - src;
                printf("Name: <<%.*s>>\n", (int)len, src);
                src += len;
                src += strspn(src, " ");
            }
        }
    } 
    return EXIT_SUCCESS;
}

This was designed to find a particular line starting DAEMONS= in a file /etc/rc.conf (but you can specify an alternative file name on the command line). You can adapt it to your purposes easily enough.

like image 141
Jonathan Leffler Avatar answered Sep 28 '22 15:09

Jonathan Leffler