Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does regex_match throw "complexity exception"?

I am trying to test (using boost::regex) whether a line in a file contains only numeric entries seperated by spaces. I encountered an exception which I do not understand (see below). It would be great if someone could explain why it is thrown. Maybe I am doing something stupid here in my way of defining the patterns? Here is the code:

// regex_test.cpp
#include <string>
#include <iostream>
#include <boost/regex.hpp>
using namespace std;
using namespace boost;

int main(){
  // My basic pattern to test for a single numeric expression
  const string numeric_value_pattern = "(?:-|\\+)?[[:d:]]+\\.?[[:d:]]*";
  // pattern for the full line
  const string numeric_sequence_pattern = "([[:s:]]*"+numeric_value_pattern+"[[:s:]]*)+";

  regex r(numeric_sequence_pattern);
  string line= "1 2 3 4.444444444444";
  bool match = regex_match(line, r);
  cout<<match<<endl;

  //...
}

I compile that successfully with

g++ -std=c++11 -L/usr/lib64/ -lboost_regex regex_test.cpp  

The resulting program worked fine so far and match == true as I wanted. But then I test an input line like

string line= "1 2 3 4.44444444e-16"; 

Of course, my pattern isn't built to recognise the format 4.44444444e-16 and I would expect that match == false. However, instead I get the following runtime error:

terminate called after throwing an instance of  
'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::runtime_error> >'  
  what():  The complexity of matching the regular expression exceeded predefined bounds.  
Try refactoring the regular expression to make each choice made by the state machine unambiguous.  
This exception is thrown to prevent "eternal" matches that take an indefinite period time to locate.  

Why is that?
Note: the example I gave is extremal in the sense that putting one digit less after the dot works ok. That means

string line= "1 2 3 4.4444444e-16";

just results in match == false as expected. So, I'm baffled. What is happening here?

Thanks already!


Update:
Problem seems to be solved. Given the hint of alejrb I refactored the pattern to

const string numeric_value_pattern = "(?:-|\\+)?[[:d:]]+(?:\\.[[:d:]]*)?";  

That seems to work as it should. Somehow, the isolated optional \\. inside the original pattern [[:d:]]+\\.?[[:d:]]* left to many possibilities to match a long sequence of digits in different ways.
I hope the pattern is safe now. However, if someone finds a way to use it for a blow up in the new form, let me know! It's not so obvious for me whether that might still be possible...

like image 811
Leolo Avatar asked Oct 21 '22 14:10

Leolo


1 Answers

I'd say that your regex is probably exponentially backtracking. To protect you from a loop that would become entirely unworkable if the input were any longer, the regex engine just aborts the attempt.

One of the patterns that often causes this problem is anything of the form (x+x+)+ - which you build up here when you place the first pattern inside the second.

There's a good discussion at http://www.regular-expressions.info/catastrophic.html

like image 178
alejrb Avatar answered Oct 23 '22 04:10

alejrb