Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

std::regex constructor throws an exception

Tags:

c++

regex

c++11

Note that this is not a duplicate of the many questions on StackOverflow concerning gcc, I'm using Visual Studio 2013.

This simple construction of a regular expression throws std::regex_error:

bool caseInsensitive = true;
char pattern[] = "\\bword\\b";
std::regex re(pattern, std::regex_constants::ECMAScript | (caseInsensitive ? std::regex_constants::icase : 0));

The actual error returned by what on the exception object is not consistent. Usually it's a mismatched parethesis or brace. Why?

like image 966
Mark Ransom Avatar asked Jan 07 '16 20:01

Mark Ransom


2 Answers

The problem arises because of the multiple constructors available for std::regex. Tracing into the constructor showed it using one I didn't intend!

I wanted to use this one:

explicit basic_regex(_In_z_ const _Elem *_Ptr,
    flag_type _Flags = regex_constants::ECMAScript)

But I got this one instead:

basic_regex(_In_reads_(_Count) const _Elem *_Ptr, size_t _Count,
    flag_type _Flags = regex_constants::ECMAScript)

The ternary expression in the flags causes the type to change to int, which no longer matches flag_type in the constructor signature. Since it does match on size_t it calls that constructor instead. The flags are misinterpreted as the size of the string, resulting in undefined behavior when the memory past the end of the string is accessed.

The problem is not specific to Visual Studio. I was able to duplicate it in gcc: http://ideone.com/5DjYiz

It can be fixed two ways. First is an explicit cast of the argument:

std::regex re(pattern, static_cast<std::regex::flag_type>(std::regex_constants::ECMAScript | (caseInsensitive ? std::regex_constants::icase : 0)));

Second is to avoid integer constants in the ternary expression:

std::regex re(pattern, caseInsensitive ? std::regex_constants::ECMAScript | std::regex_constants::icase : std::regex_constants::ECMAScript);
like image 196
Mark Ransom Avatar answered Oct 19 '22 23:10

Mark Ransom


I don't find either of the proposed solutions particularly compelling or aesthetically pleasing. I think I'd prefer something like this:

auto options = std::regex_constants::ECMAScript;
if (caseInsensitive) 
    options |= std::regex_constants::icase;

std::regex re(pattern, options);

If, for some misguided reason, you really insist on a single line of code, I'd use a value-constructed object of the correct type in the ternary expression:

std::regex re(pattern, std::regex_constants::ECMAScript | (caseInsensitive ? std::regex_constants::icase : std::regex_constants::std::regex_option_type{}));

Or, since ECMAScript is the default, you use:

std::regex re(pattern, (caseInsensitive ? std::regex_constants::icase : std::regex_constants::ECMAScript));

At least to my eye, the first of these is clearly preferable though.

like image 8
Jerry Coffin Avatar answered Oct 20 '22 00:10

Jerry Coffin