I was given a link to the following article regarding the implementation of regular expressions in many modern languages.
http://swtch.com/~rsc/regexp/regexp1.html
TL;DNR: Certain regular expressions such as (a?)^na^n
for fixed $n$ take exponential time matched against, say, a^n
because its implemented via backtracking over the string when matching the ?
section. Implementing these as an NFA by keeping state lists makes this much more efficient for obvious reasons
The details of how each language actually implements these isn't very detailed (and the article is old), but I'm curious: what, if any, are the drawbacks of using an NFA as opposed to other implementation techniques. The only thing I can come up with is that with all the bells and whistles of most libraries either a) building a NFA for all those features is impractical or b) there is some conflicting performance issue between the expression above and some other, possibly more common, operation.
Regular expressions are useful in search and replace operations. The typical use case is to look for a sub-string that matches a pattern and replace it with something else. Most APIs using regular expressions allow you to reference capture groups from the search pattern in the replacement string.
The findings of this experiment showed that thousands of modules (20%) shared the same regexes, both within and across languages and that 5% of all corpus modules (about 10,000) primarily written in JavaScript, use regexes from Stack Overflow and RegExLib.
Regular expressions are dense. This makes them hard to read, but not in proportion to the information they carry. Certainly 100 characters of regular expression syntax is harder to read than 100 consecutive characters of ordinary prose or 100 characters of C code.
A wildcard — sometimes called a placeholder — is indicated in the above code by the character couple . * . This . * instructs the interpreter to search the target text for any of the characters to the left of it, which can then be followed by any character or characters to the right of it.
While it is possible to construct DFAs that handle these complex cases well (the Tcl RE engine, which was written by Henry Spencer, is a proof by example; the article linked indicated this with its performance data) it's also exceptionally hard.
One key thing though is that if you can detect that you never need the matching group information, you can then (for many REs, especially those without internal backreferences) transform the RE into one that only uses parentheses for grouping allowing a more efficient RE to be generated (so (a?){n}a{n}
— I'm using modern conventional syntax — becomes effectively equivalent to a{n,2n}
). Backreferences break that major optimisation; it's not for nothing that in Henry's RE code (alluded to above) there is a code comment describing them as the “Feature from the Black Lagoon”. It is one of the best comments I've ever read in code (with the exception of references to academic papers that describe the algorithm encoded).
On the other hand, the Perl/PCRE style engines with their recursive-descent evaluation schemes, can ascribe a much saner set of semantics to mixed greediness REs, and many other things besides. (At the extreme end of this, recursive patterns — (?R)
et al — are completely impossible with automata-theoretic approaches. They require a stack to match, making them formally not be regular expressions.)
On a practical level, the cost of building the NFA and the DFA you then compile that to can be quite high. You need clever caching to make it not too expensive. And also on a practical level, the PCRE and Perl implementations have had a lot more developer effort applied to them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With