Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When using regex in C, \d does not work but [0-9] does

Tags:

c

regex

I do not understand why the regex pattern containing the \d character class does not work but [0-9] does. Character classes, such as \s (whitespace characters) and \w (word characters), do work. My compiler is gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3. I am using the C regular expression library.

Why doesn't \d work?

Text string:

const char *text = "148  apples    5 oranges";

For the above text string, this regex does not match:

const char *rstr = "^\\d+\\s+\\w+\\s+\\d+\\s+\\w+$";

This regex matches when using [0-9] instead of \d:

const char *rstr = "^[0-9]+\\s+\\w+\\s+[0-9]+\\s+\\w+$";



#include <stdio.h>
#include <stdlib.h>
#include <regex.h>

#define N_MATCHES  30

//   output from gcc --version: gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
//   compile command used:  gcc -o tstc_regex tstc_regex.c

const char *text = "148  apples    5 oranges";
  const char *rstr = "^[0-9]+\\s+\\w+\\s+[0-9]+\\s+\\w+$";    // finds match
//const char *rstr = "^\\d+\\s+\\w+\\s+\\d+\\s+\\w+$";        // does not find match

int main(int argc, char**argv)
{
    regex_t   rgx;
    regmatch_t   matches[N_MATCHES];
    int status;
    status = regcomp(&rgx, rstr, REG_EXTENDED | REG_NEWLINE);
    if (status != 0) {
        fprintf(stdout, "regcomp error: %d\n", status);
        return 1;
    }
    status = regexec(&rgx, text, N_MATCHES, matches, 0);
    if (status == REG_NOMATCH) {
        fprintf(stdout, "regexec result: REG_NOMATCH (%d)\n", status);
    }
    else if (status != 0) {
        fprintf(stdout, "regexec error: %d\n", status);
        return 1;
    }
    else {
        fprintf(stdout, "regexec match found: %d\n", status);
    }
    return 0;
}
like image 764
piedog Avatar asked Dec 15 '22 05:12

piedog


2 Answers

The regex flavor you're using is GNU ERE, which is similar to POSIX ERE, but with a few extra features. Among these are support for the character class shorthands \s, \S, \w and \W, but not \d and \D. You can find more info here.

like image 74
Alan Moore Avatar answered Dec 23 '22 20:12

Alan Moore


Trying either pattern in a strictly POSIX environment will likely end up having no matches; if you want to make the pattern truly POSIX compatible use all bracket expressions:

const char *rstr = "^[[:digit:]]+[[:space:]]+[[:alpha:]]+[[:space:]]+[[:digit:]]+[[:space:]]+[[:alpha:]]+$";

↳ POSIX Character_classes

like image 29
l'L'l Avatar answered Dec 23 '22 22:12

l'L'l