Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is Regex (c++) taking exponential time?

I'm doing some regex problems from a textbook and of them reads the following:

"[Match] all strings that start at the beginning of the line with an integer and that end at the end of the line with a word."

I wrote the following regular expression for this:

^[0-9]+\s.*+\b[a-zA-Z]+$

However when I implemented this in C++ with the following code:

#include <iostream>
#include <string>
#include <regex>
#include <time.h>

int main(){
    clock_t t;
    bool match;
    std::string exp = "^[0-9]+\\s.*+\b[a-zA-Z]+$";
    std::string str = "1 a few words 1";
    std::string s (str);
    std::smatch m;
    std::regex e (exp);
    while (true){
        t = clock();
        match = std::regex_match(s, m, e); 
        s = s + "1";
        std::cout << clock() - t << std::endl;
    }   
}

The cpu time taken per iteration was:

1 1181529
2 3398674
3 10102763
4 30370932
5 92491242

which looks like it complexity is O( 3^n )

Why would this be? Is there something I'm doing wrong in the expression?

The growth factor is the same if I use a string like "1 a 1" though with a smaller constant.

Edit: I see the issue is that I have a .*+ oops! Still I'm not sure why this would lead to exponential behavior.

like image 798
Cory Nezin Avatar asked Jul 17 '18 20:07

Cory Nezin


1 Answers

The problem is from having .*+\b instead of the .*\\b that I'm pretty sure you intended.

As to why that would cause horrible behavior: the problem is that .* can math some arbitrary number of characters, and + means to match an arbitrary number of those. But, to fit POSIX specs, it has to try to make the overall pattern match as long a string as possible. My guess is that to do that, it's starting by trying to use the .* to match one character, and repeating it N times. Then it tries with the .* matching two characters, and repeating that M times. Then it's trying with the .* matching three characters, and repeating them L times (and so on). Oh, and note that it doesn't have to have all the .* patterns matching the same number of characters either, so the number of combinations grows exponentially.

Since it doesn't know how many characters it should match overall, it tries every possible combination until it reaches the last one, finds that they all matched the same length of string, and declares it an overall failure (since you had a \b which is a back-space character which wasn't present in your input string). Depending on whether you use an NFA or a DFA for your regex matching, you could get either the horrible behavior you observed, or you could get completely linear behavior--or (depending on how you did your DFA/NFA conversion) it might just fail to compile the regex (which is probably not quite conforming, but still probably preferable behavior).

like image 105
Jerry Coffin Avatar answered Oct 07 '22 00:10

Jerry Coffin