Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do most languages implement wildcard regular expressions inefficiently?

Tags:

regex

I was given a link to the following article regarding the implementation of regular expressions in many modern languages.

http://swtch.com/~rsc/regexp/regexp1.html

TL;DNR: Certain regular expressions such as (a?)^na^n for fixed $n$ take exponential time matched against, say, a^n because its implemented via backtracking over the string when matching the ? section. Implementing these as an NFA by keeping state lists makes this much more efficient for obvious reasons

The details of how each language actually implements these isn't very detailed (and the article is old), but I'm curious: what, if any, are the drawbacks of using an NFA as opposed to other implementation techniques. The only thing I can come up with is that with all the bells and whistles of most libraries either a) building a NFA for all those features is impractical or b) there is some conflicting performance issue between the expression above and some other, possibly more common, operation.

like image 214
dfb Avatar asked Jan 31 '14 18:01

dfb


People also ask

What are regular expressions used for?

Regular expressions are useful in search and replace operations. The typical use case is to look for a sub-string that matches a pattern and replace it with something else. Most APIs using regular expressions allow you to reference capture groups from the search pattern in the replacement string.

Is RegEx the same across languages?

The findings of this experiment showed that thousands of modules (20%) shared the same regexes, both within and across languages and that 5% of all corpus modules (about 10,000) primarily written in JavaScript, use regexes from Stack Overflow and RegExLib.

Why is RegEx so complicated?

Regular expressions are dense. This makes them hard to read, but not in proportion to the information they carry. Certainly 100 characters of regular expression syntax is harder to read than 100 consecutive characters of ordinary prose or 100 characters of C code.

What are wildcards in data visualization?

A wildcard — sometimes called a placeholder — is indicated in the above code by the character couple . * . This . * instructs the interpreter to search the target text for any of the characters to the left of it, which can then be followed by any character or characters to the right of it.


1 Answers

While it is possible to construct DFAs that handle these complex cases well (the Tcl RE engine, which was written by Henry Spencer, is a proof by example; the article linked indicated this with its performance data) it's also exceptionally hard.

One key thing though is that if you can detect that you never need the matching group information, you can then (for many REs, especially those without internal backreferences) transform the RE into one that only uses parentheses for grouping allowing a more efficient RE to be generated (so (a?){n}a{n} — I'm using modern conventional syntax — becomes effectively equivalent to a{n,2n}). Backreferences break that major optimisation; it's not for nothing that in Henry's RE code (alluded to above) there is a code comment describing them as the “Feature from the Black Lagoon”. It is one of the best comments I've ever read in code (with the exception of references to academic papers that describe the algorithm encoded).

On the other hand, the Perl/PCRE style engines with their recursive-descent evaluation schemes, can ascribe a much saner set of semantics to mixed greediness REs, and many other things besides. (At the extreme end of this, recursive patterns — (?R) et al — are completely impossible with automata-theoretic approaches. They require a stack to match, making them formally not be regular expressions.)

On a practical level, the cost of building the NFA and the DFA you then compile that to can be quite high. You need clever caching to make it not too expensive. And also on a practical level, the PCRE and Perl implementations have had a lot more developer effort applied to them.

like image 130
Donal Fellows Avatar answered Jan 05 '23 03:01

Donal Fellows