Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ regex bug! Square bracket expression does not work with icase flag

// regex_replace example
#include <iostream>
#include <string>
#include <regex>
#include <iterator>

int main ()
{
  std::string INPUT = "Replace_All_Characters_With_Anything";
  std::string OUTEXP = "0";
  std::regex expression("[A-Za-z]", std::regex_constants::icase);
  std::cout << std::regex_replace(INPUT, expression, OUTEXP);

  return 0;
}

This works here: http://cpp.sh/6gb5a This works here: https://regexr.com/5bt9d

The problem seems to be down to using icase flag or not. A in All, the C in Characters, the W in With, etc. does not get replaced because of the underscore existing. The bug seems to be that using [] to match things only works if said character does not come after a non match.

There does seem to be a quick fix for this, if brackets are followed by a {1}, then it works.

example: [A-Za-z]{1}

Compiler: Microsoft Visual Studio Community 2019 / Version 16.7.3 / c++17

Also tested in c++14, same bad behavior

expected result:
enter image description here

my result:
enter image description here

like image 532
Elan Hickler Avatar asked Sep 11 '20 22:09

Elan Hickler


People also ask

How do I enable square brackets in regex?

How do you use square brackets in regex? Use square brackets ( [] ) to create a matching list that will match on any one of the characters in the list. Virtually all regular expression metacharacters lose their special meaning and are treated as regular characters when used within square brackets.

What do the [] brackets mean in regular expressions?

Square brackets ( “[ ]” ): Any expression within square brackets [ ] is a character set; if any one of the characters matches the search string, the regex will pass the test return true.

What is the use of AZ in regular expression?

The regular expression [A-Z][a-z]* matches any sequence of letters that starts with an uppercase letter and is followed by zero or more lowercase letters.


1 Answers

Not sure if this is an appropriate use of answering. But this is a known bug and it looks like the bug has been known for a few months. No ETA on a fix as far as I can see.

https://github.com/microsoft/STL/issues/993

Looks like RE2 is a recommended alternative regex library.

https://github.com/google/re2/

Instead of using another library, I will create a function that can be used to intercept and change the regex expression string as a temporary fix. Should work whether or not icase flag is used.

test code: https://rextester.com/LSNW3495

// add '{1}' after square bracket ranges unless there already is a quantifier or alternation such as '?' '*' '+' '{}' 
std::string temporaryBugFix(std::string exp)
{
    enum State
    {
        start,
        skipNext,
        lookForEndBracket,
        foundEndBracket,
    };

    State state = start;
    State prevState = start;

    int p = -1;
    std::vector<int> positionsToFix;

    for (auto c : exp)
    {
        ++p;

        switch (state)
        {
        case start:
            if (c == '\\')
            {
                prevState = state;
                state = skipNext;
            }
            else if (c == '[')
                state = lookForEndBracket;

            continue;

        case skipNext:
            state = prevState;
            continue;

        case lookForEndBracket:
            if (c == '\\')
            {
                prevState = state;
                state = skipNext;
            }
            else if (c == ']')
            {
                state = foundEndBracket;
                if (p + 1 == exp.length())
                    positionsToFix.push_back(p + 1);
            }
            continue;

        case foundEndBracket:
            if (c != '+' && c != '*' && c != '?')
                positionsToFix.push_back(p);
            state = start;
            continue;
        }
    }

    // check for valid curly brackets so we don't add an additional one
    std::string s = exp;
    std::smatch m;
    std::regex e("\\{\\d+,?\\d*?\\}");

    int offset = 0;
    vector<int> validCurlyBracketPositions;
    while (regex_search(s, m, e))
    {
        validCurlyBracketPositions.push_back(m.position(0) + offset);
        offset += m.position(0) + m[0].length();
        s = m.suffix();
    }

    // remove valid curly bracket positions from the fix vector
    for (auto p : validCurlyBracketPositions)
        positionsToFix.erase(std::remove(positionsToFix.begin(), positionsToFix.end(), p), positionsToFix.end());

    // insert the fixes
    for (int i = positionsToFix.size(); i--; )
        exp.insert(positionsToFix[i], "{1}");

    return exp;
}
like image 185
Elan Hickler Avatar answered Oct 15 '22 06:10

Elan Hickler