Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Different behavior in C regex vs C++ regex using extended POSIX grammar

I am seeing different results when using the C POSIX regex library and the C++ standard library implementation. Here is my code:

string pattern = "\\s";
string testString = " ";

regex_t cre;
int status = regcomp(&cre, pattern.c_str(), REG_EXTENDED);
int result = (regexec(&cre, testString.c_str(), 0, 0, 0) == 0);
cout << "C: " << result << endl;

regex re(pattern, regex_constants::extended);
smatch sm;
cout << "C++: " << regex_search(testString, sm, re) << endl;

The C portion successfully matches the whitespace, but the C++ one throws this error:

terminate called after throwing an instance of 'std::regex_error'
what(): Unexpected escape character.

I understand that the string literal is escaped meaning that the actual regex that is used in pattern matching should be \s. I also only see this issue when using POSIX extended grammar. In the C++ version, if I do not specify POSIX extended grammar when constructing the regex, it defaults to ECMAScript grammar and is able to parse correctly.

What is going on here?

like image 691
E G Avatar asked Jul 22 '21 00:07

E G


People also ask

What is Posix regex?

POSIX bracket expressions are a special kind of character classes. POSIX bracket expressions match one character out of a set of characters, just like regular character classes. They use the same syntax with square brackets. A hyphen creates a range, and a caret at the start negates the bracket expression.

What kind of regex does C++ use?

C++11 uses ECMAScript grammar as the default grammar for regex. ECMAScript is simple, yet it provides powerful regex capabilities.

What is an extended regular expression?

An extended regular expression specifies a set of strings to be matched. The expression contains both text characters and operator characters. Text characters match the corresponding characters in the strings being compared. Operator characters specify repetitions, choices, and other features.

What are regex patterns?

A regular expression is a pattern that the regular expression engine attempts to match in input text. A pattern consists of one or more character literals, operators, or constructs.

What is extended regular expressions in POSIX?

POSIX Extended Regular Expressions. The Extended Regular Expressions or ERE flavor standardizes a flavor similar to the one used by the UNIX egrep command. "Extended" is relative to the original UNIX grep, which only had bracket expressions, dot, caret, dollar and star.

What is regex in C++?

C++ Regex 101 Published February 28, 2020 Since C++11, the C++ standard library contains the <regex> header, that allows to compare string against regular expressions (regexes). This greatly simplifies the code when we need to perform such operations.

How to compare string with regular expressions in C++?

Since C++11, the C++ standard library contains the <regex> header, that allows to compare string against regular expressions (regexes). This greatly simplifies the code when we need to perform such operations.

What are regexes?

Regexes are often used to denote a standard textual syntax of a string. => Visit Here To See The C++ Training Series For All. Each character in a regular expression is either having a character with a literal meaning or a “metacharacter” that has special meaning. For example, a regular expression “a [a-z]” can have values ‘aa’, ‘ab’,’ ax’ etc.


Video Answer


1 Answers

regex_constants::extended triggers the POSIX ERE regex syntax that does not support shorthand character classes. Note the C regex.h module supports \s as a non-standard extension.

To match any whitespace in regex_constants::extended enabled POSIX ERE flavor, you need to use string pattern = "[[:space:]]".

However, you should just rely on the default ECMAScript flavor, and use

regex re(pattern);
// or
regex re(pattern, std::regex::ECMAScript);
like image 156
Wiktor Stribiżew Avatar answered Oct 20 '22 12:10

Wiktor Stribiżew