Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The difference between std::regex and boost::regex

Tags:

c++

regex

When there is \r in the matching string, std::regex and boost::regex behave differently. Why?

code:

#include <iostream>
#include <string>
#include <regex>
#include <boost/regex.hpp>

int main()
{ 
    std::string content = "123456728\r,234";
    std::string regex_string = "2.*?4";

    boost::regex reg(regex_string);

    boost::sregex_iterator it(content.begin(),content.end(),reg);
    boost::sregex_iterator end;
    std::cout <<"content size:" << content.size() << std::endl;

    //boost match 234 and 28\r,234
    while (it != end) 
    {
        std::cout <<"boost match: " << it->str(0) <<" size: " <<it->str(0).size() << std::endl;
        ++it;
    }

    std::regex regex_std(regex_string);
    std::sregex_iterator it_std(content.begin(),content.end(),regex_std);
    std::sregex_iterator std_end;

    //std match 234 and 234
    while (it_std != std_end) 
    {
        std::cout <<"std match: " << it_std->str(0) <<" size: " << it_std->str(0).size() << std::endl;
        ++it_std;
    }

    return 0;
}

I think the boost library behaves normally, but I don't understand why the standard library is implemented this way.

like image 531
yi fu Avatar asked Oct 20 '25 02:10

yi fu


1 Answers

That is expected.

std::regex default flavor is ECMAScript-262, and in ECMAScript, the . char matches any character but any LineTerminator character:

The production Atom :: . evaluates as follows:

  1. Let A be the set of all characters except LineTerminator.
  2. Call CharacterSetMatcher(A, false) and return its Matcher result.

And then 7.3Line Terminators says:

Line terminators are included in the set of white space characters that are matched by the \s class in regular expressions.

Code Unit Value Name Formal Name
\u000A Line Feed <LF>
\u000D Carriage Return <CR>
\u2028 Line separator <LS>
\u2029 Paragraph separator <PS>

In Boost regex, however, . matches

The NULL character when the flag match_not_dot_null is passed to the matching algorithms.
The newline character when the flag match_not_dot_newline is passed to the matching algorithms.

So, . in Boost regex matches \r, in std::regex, it does not.

like image 186
Wiktor Stribiżew Avatar answered Oct 21 '25 15:10

Wiktor Stribiżew



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!