Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

On libc++, why does regex_match("tournament", regex("tour|to|tournament")) fail?

In http://llvm.org/svn/llvm-project/libcxx/trunk/test/re/re.alg/re.alg.match/ecma.pass.cpp, the following test exists:

    std::cmatch m;
    const char s[] = "tournament";
    assert(!std::regex_match(s, m, std::regex("tour|to|tournament")));
    assert(m.size() == 0);

Why should this match be failed?

On VC++2012 and boost, the match succeeds.
On Javascript of Chrome and Firefox, "tournament".match(/^(?:tour|to|tournament)$/) succeeds.

Only on libc++, the match fails.

like image 712
ganaware Avatar asked Jul 12 '13 07:07

ganaware


1 Answers

I believe the test is correct. It is instructive to search for "tournament" in all of the libc++ tests under re.alg, and compare how the different engines treat the regex("tour|to|tournament"), and how regex_search differs from regex_match.

Let's start with regex_search:

awk, egrep, extended:

regex_search("tournament", m, regex("tour|to|tournament"))

matches the entire input string: "tournament".

ECMAScript:

regex_search("tournament", m, regex("tour|to|tournament"))

matches only part of the input string: "tour".

grep, basic:

regex_search("tournament", m, regex("tour|to|tournament"))

Doesn't match at all. The '|' character is not special.

awk, egrep and extended will match as much as they can with alternation. However the ECMAScript alternation is "ordered". This is specified in ECMA-262. Once ECMAScript matches a branch in the alternation, it quits searching. The standard includes this example:

/a|ab/.exec("abc")

returns the result "a" and not "ab".

<plug>

This is also discussed in depth in Mastering Regular Expressions by Jeffrey E.F. Friedl. I couldn't have implemented <regex> without this book. And I will freely admit that there is still much more that I don't know about regular expressions, than what I know.

At the end of the chapter on alternation the author states:

If you understood everything in this chapter the first time you read it, you probably didn't read it in the first place.

Believe it!

</plug>

Anyway, ECMAScript matches only "tour". The regex_match algorithm returns success only if the entire input string is matched. Since only the first 4 characters of the input string are matched, then unlike awk, egrep and extended, ECMAScript returns false with a zero-sized cmatch.

like image 52
Howard Hinnant Avatar answered Nov 18 '22 17:11

Howard Hinnant