Why do most languages implement wildcard regular expressions inefficiently?

Tags:

regex

I was given a link to the following article regarding the implementation of regular expressions in many modern languages.

http://swtch.com/~rsc/regexp/regexp1.html

TL;DNR: Certain regular expressions such as (a?)^na^n for fixed $n$ take exponential time matched against, say, a^n because its implemented via backtracking over the string when matching the ? section. Implementing these as an NFA by keeping state lists makes this much more efficient for obvious reasons

The details of how each language actually implements these isn't very detailed (and the article is old), but I'm curious: what, if any, are the drawbacks of using an NFA as opposed to other implementation techniques. The only thing I can come up with is that with all the bells and whistles of most libraries either a) building a NFA for all those features is impractical or b) there is some conflicting performance issue between the expression above and some other, possibly more common, operation.

214

asked Jan 31 '14 18:01

dfb

1 Answers

While it is possible to construct DFAs that handle these complex cases well (the Tcl RE engine, which was written by Henry Spencer, is a proof by example; the article linked indicated this with its performance data) it's also exceptionally hard.

One key thing though is that if you can detect that you never need the matching group information, you can then (for many REs, especially those without internal backreferences) transform the RE into one that only uses parentheses for grouping allowing a more efficient RE to be generated (so (a?){n}a{n} — I'm using modern conventional syntax — becomes effectively equivalent to a{n,2n}). Backreferences break that major optimisation; it's not for nothing that in Henry's RE code (alluded to above) there is a code comment describing them as the “Feature from the Black Lagoon”. It is one of the best comments I've ever read in code (with the exception of references to academic papers that describe the algorithm encoded).

On the other hand, the Perl/PCRE style engines with their recursive-descent evaluation schemes, can ascribe a much saner set of semantics to mixed greediness REs, and many other things besides. (At the extreme end of this, recursive patterns — (?R) et al — are completely impossible with automata-theoretic approaches. They require a stack to match, making them formally not be regular expressions.)

On a practical level, the cost of building the NFA and the DFA you then compile that to can be quite high. You need clever caching to make it not too expensive. And also on a practical level, the PCRE and Perl implementations have had a lot more developer effort applied to them.

130

answered Jan 05 '23 03:01

Donal Fellows

Related questions
                            
                                Remove everything within script and style tags
                            
                                Regular expression to remove hostname and port from URL?
                            
                                javascript regex validate years in range
                            
                                How can I parse quoted CSV in Perl with a regex?
                            
                                Parsing CSS by regex
                            
                                Allow only [a-z][A-Z][0-9] in string using PHP
                            
                                Extract numbers from string to create digit only string [duplicate]
                            
                                REGEX: Capture Filename from URL without file extension
                            
                                Create preg_match for password validation allowing (!@#$%)
                            
                                Why can't Regular Expressions use keywords instead of characters?
                            
                                Regex to validate port number
                            
                                Email validation with regex [duplicate]
                            
                                Python match text successfully even when there are 1, 2 and 3 backslash at front of the same regex pattern [duplicate]
                            
                                Java: regex replacement in large files [duplicate]
                            
                                Parse a tables with unicode chars in variables from JSON with SAS BASE
                            
                                Check if string is a potential match for regex
                            
                                Why doesn't Perl v5.22 find all the sentence boundaries?
                            
                                Enable the request log, even when a router script is used for the PHP built-in web server
                            
                                Sequence of logical OR in ES6/Unicode regular expression in Chrome ✗ vs Firefox ✓
                            
                                Regex default value if not found

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With