Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bug in std::regex?

Tags:

c++

regex

c++11

Here is code :

#include <string>
#include <regex>
#include <iostream>

int main()
{
    std::string pattern("[^c]ei");
    pattern = "[[:alpha:]]*" + pattern + "[[:alpha:]]*";
    std::regex r(pattern); 
    std::smatch results;   
    std::string test_str = "cei";

    if (std::regex_search(test_str, results, r)) 
        std::cout << results.str() << std::endl;      

    return 0;
}

Output :

cei

The compiler used is gcc 4.9.1.

I'm a newbie learning regular expression.I expected nothing should be output,since "cei" doesn't match the pattern here. Am I doing it right? What's the problem?

Update:

This one has been reported and confirmed as a bug, for detail please visit here : https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63497

like image 864
Yue Wang Avatar asked Oct 09 '14 07:10

Yue Wang


2 Answers

It's a bug in the implementation. Not only do a couple other tools I tried agree that your pattern does not match your input, but I tried this:

#include <string>
#include <regex>
#include <iostream>

int main()
{
  std::string pattern("([a-z]*)([a-z])(e)(i)([a-z]*)");
  std::regex r(pattern);
  std::smatch results;
  std::string test_str = "cei";

  if (std::regex_search(test_str, results, r))
  {
    std::cout << results.str() << std::endl;

    for (size_t i = 0; i < results.size(); ++i) {
      std::ssub_match sub_match = results[i];
      std::string sub_match_str = sub_match.str();
      std::cout << i << ": " << sub_match_str << '\n';
    }
  }
}

This is basically similar to what you had, but I replaced [:alpha:] with [a-z] for simplicity, and I also temporarily replaced [^c] with [a-z] because that seems to make it work correctly. Here's what it prints (GCC 4.9.0 on Linux x86-64):

cei
0: cei
1:
2: c
3: e
4: i
5:

If I replace [a-z] where you had [^c] and just put f there instead, it correctly says the pattern doesn't match. But if I use [^c] like you did:

std::string pattern("([a-z]*)([^c])(e)(i)([a-z]*)");

Then I get this output:

cei
0: cei
1: cei
terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::_S_create
Aborted (core dumped)

So it claims to match successfully, and results[0] is "cei" which is expected. Then, results[1] is "cei" also, which I guess might be OK. But then results[2] crashes, because it tries to construct a std::string of length 18446744073709551614 with begin=nullptr. And that giant number is exactly 2^64 - 2, aka std::string::npos - 1 (on my system).

So I think there is an off-by-one error somewhere, and the impact can be much more than just a spurious regex match--it can crash at runtime.

like image 55
John Zwinck Avatar answered Nov 12 '22 05:11

John Zwinck


The regex is correct and should not match the string "cei".

The regex can be tested and explained best in Perl:

 my $regex = qr{                 # start regular expression
                 [[:alpha:]]*    # 0 or any number of alpha chars
                 [^c]            # followed by NOT-c character
                 ei              # followed by e and i characters
                 [[:alpha:]]*    # followed by 0 or any number of alpha chars    
               }x;               # end + declare 'x' mode (ignore whitespace)

 print "xei" =~ /$regex/ ? "match\n" : "no match\n";
 print "cei" =~ /$regex/ ? "match\n" : "no match\n";

The regex will first consume all chars to the end of the string ([[:alpha:]]*), then backtrack to find the NON-c char [^c] and proceed with the e and i matches (by backtracking another time).

Result:

 "xei"  -->  match
 "cei"  -->  no match

for obvious reasons. Any discrepancies to this in various C++ libraries and testing tools are the problem of the implementation there, imho.

like image 2
rubber boots Avatar answered Nov 12 '22 05:11

rubber boots