Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detect if a regexp is exponential

This article show that there is some regexp that is O(2^n) when backtracking. The example is (x+x+)+y. When attempt to match a string like xxxx...p it going to backtrack for a while before figure it out that it couldn't match.

Is there a way to detect such regexp?

thanks

like image 361
mathk Avatar asked Jul 31 '10 09:07

mathk


1 Answers

If your regexp engine exposes runtime exponential behavior for (x+x+)+y ,then it is broken because a DFA or NFA can recognize this pattern in linear time:

echo "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" | egrep "(x+x+)+y"
echo "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxy" | egrep "(x+x+)+y"

both answer immediately.

In fact, there are only a few cases (like backreferences) where backtracking is really needed (mainly, because a regexp with a backreference is not a regular expression in the language theoretic sense anymore). A capable implementation should switch to backtracking only when these corner cases are given.

In fairness, DFA's have a dark side too, because some regexp's have exponential size requirements, but a size contraints is easier to enforce than a time constraint and the huge DFA runs linear on the input, so it's a better bargain than a small backtracker choking on a couple of X's.

You should really read Russ Cox excellent article series about the implementation of regexp (and the pathological behavior of backtracking): http://swtch.com/~rsc/regexp/

To answer your question about decidability: You can't. Because there is not the one backtracking for regexpr. Every implementation has its own strategies to deal with exponential growth in their algorithm for certain cases and does not cover others. One rule might be fit for here and catastrophic for there.

UPDATE:

For example, one implementation could contain an optimizer which could use algebraic transformations to simplify regexps before executing them: (x+x+)+y is the same a xxx*y, which shouldn't be a problem for any backtracker. But the same optimizer wouldn't recognize the next expression and the problem is there again. Here someone described how to craft a regexpr which fools Perl's optimizer:

http://perlgeek.de/blog-en/perl-tips/in-search-of-an-exponetial-regexp.html

like image 71
Nordic Mainframe Avatar answered Nov 06 '22 03:11

Nordic Mainframe