Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why can regular expressions have an exponential running time?

Tags:

regex

People also ask

Why are regular expressions fast?

Good regular expressions are often longer than bad regular expressions because they make use of specific characters/character classes and have more structure. This causes good regular expressions to run faster as they predict their input more accurately.

Why are regular expressions slow?

The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ". *" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.

Is regular expression faster than loop?

Regex is faster for large string than an if (perhaps in a for loops) to check if anything matches your requirement.


Russ Cox has a very detailed article about why this is and the history of regexes (part 2, part 3).

Regular expression matching can be simple and fast, using finite automata-based techniques that have been known for decades. In contrast, Perl, PCRE, Python, Ruby, Java, and many other languages have regular expression implementations based on recursive backtracking that are simple but can be excruciatingly slow. With the exception of backreferences, the features provided by the slow backtracking implementations can be provided by the automata-based implementations at dramatically faster, more consistent speeds.

Largely, it comes down to proliferation of non-regular features in "regular" expressions such as backreferences, and the (continued) ignorance of most programmers that there are better alternatives for regexes that do not contain such features (which is many of them).

While writing the text editor sam in the early 1980s, Rob Pike wrote a new regular expression implementation, which Dave Presotto extracted into a library that appeared in the Eighth Edition. Pike's implementation incorporated submatch tracking into an efficient NFA simulation but, like the rest of the Eighth Edition source, was not widely distributed. Pike himself did not realize that his technique was anything new. Henry Spencer reimplemented the Eighth Edition library interface from scratch, but using backtracking, and released his implementation into the public domain. It became very widely used, eventually serving as the basis for the slow regular expression implementations mentioned earlier: Perl, PCRE, Python, and so on. (In his defense, Spencer knew the routines could be slow, and he didn't know that a more efficient algorithm existed. He even warned in the documentation, “Many users have found the speed perfectly adequate, although replacing the insides of egrep with this code would be a mistake.”) Pike's regular expression implementation, extended to support Unicode, was made freely available with sam in late 1992, but the particularly efficient regular expression search algorithm went unnoticed.


Regular expressions conforming to this formal definition are computable in linear time, because they have corresponding finite automatas. They are built only from parentheses, alternative | (sometimes called sum), Kleene star * and concatenation.

Extending regular expressions by adding, for example, backward references can lead even to NP-complete regular expressions. Here you can find an example of regular expression recognizing non-prime numbers.

I guess that, such an extended implementation can have non-linear matching time even in simple cases.

I made a quick experiment in Perl and your regular expression computes equally fast for odd and even number of 'a's.