What is the basic approach used by perl, python, java and vim etc, to implement parsing with regular expressions?
Not the clever formal language approaches (i.e. NFA, DFA); nor parser combinators (e.g. 14 line regex engine).
I've had a look at the source for java's implementation of perl style regular expressions, but its sophistcated features (e.g. back-references) and efficiency (e.g. Boyer-Moore substring matching), made it difficult to see how it basically works.
EDIT
Various sources say "backtracking" is involved (e.g. Regular Expression Matching Can Be Simple And Fast; formal methods courses), but are not clear on exactly what is backtracked on... is it a way to evaluate an NFA? Can it be done from the AST of the regex directly?
What do java/perl/python regex engines actually do?
Is it something like: "a way to generate all possible words in the regular language, but abandon a specific word as soon as it doesn't match the input string".
There are two general approaches in regular expression engines.
Regular expressions can be transformed into finite automata. This relationship is well-studied in computer science. These finite automate can then be executed efficiently, and even run backwards. They provide strong guarantees, such as running in linear time and constant space with regards to the input string, but creating the finite automaton from the regex can be expensive. This approach also restricts the engine to real regular expressions, i.e. precludes advanced features such as back-references or recursion.
Regular expressions can be interpreted by a backtracking engine. If an alternative in the pattern fails, this can back-track to the last decision point and try a different approach. This is extremely flexible and (with extra features like recursion + named subpatterns) can parse a much larger class of formal languages (formally, something like the LL(*) set of grammars). This is very similar to PEG parsers. The big drawback: due to the backtracking, running the regex can take exponential time – even without any extra advanced features.
On top of that, regex engines feature extra optimizations such as first searching for constant substrings in the pattern, since that is more efficient than running any kind of regex (any may even be able to use vectorized CPU instructions). If there is a choice point between multiple constant strings, those can be compiled very easily into a trie data structure (effectively, a simple finite automaton). This can reduce the amount of backtracking.
A regex that demonstrates the difference between finite automata and backtracking is the a*a*a*a*a*b pattern on the string aaaaaaaaaaaaaaacb. A finite automaton can easily see that this pattern will not match because of the c in the input. But a backtracking engine now has many many decision points where it can try different lengths for each a* subpattern. Regex engines like Perl's or the re module in Python go exponential in this case, i.e. take very long to complete – add more as to the input to make it take longer. This allows for interesting denial of service attacks if untrusted users can supply arbitrary regexes. For untrusted input, only finite-automata based regex engines should be used, e.g. RE2 from Google.
Regex in Perl 2 were taken from Henry Spencer.
regexp.c was only two thousand lines, it's not as difficult as later versions with more features.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With