Optimization techniques for backtracking regex implementations

Tags:

I'm trying to implement a regular expression matcher based on the backtracking approach sketched in Exploring Ruby’s Regular Expression Algorithm. The compiled regex is translated into an array of virtual machine commands; for the backtracking the current command and input string indices as well as capturing group information is maintained on a stack.

In Regular Expression Matching: the Virtual Machine Approach Cox gives more detailed information about how to compile certain regex components into VM commands, though the discussed implementations are a bit different. Based on thoese articles my implementation works quite well for the standard grouping, character classes and repetition components.

Now I would like to see what extensions and optimization options there are for this type of implementation. Cox gives in his article a lot of useful information on the DFA/NFA approach, but the information about extensions or optimization techniques for the backtracking approach is a bit sparse.

For example, about backreferences he states

Backreferences are trivial in backtracking implementations.

and gives an idea for the DFA approach. But it's not clear to me how this can be "trivially" done with the VM approach. When the backreference command is reached, you'd have to compile the previously matched string from the corresponding group into another list of VM commands and somehow incorporate those commands into the current VM, or maintain a second VM and switch execution temporarily to that one.

He also mentions a possible optimization in repetitions by using look-aheads, but doesn't elaborate on how that would work. It seems to me this could be used to reduce the number items on the backtracking stack.

tl;dr What general optimization techniques exist for VM-based backtracking regex implementations and how do they work? Note that I'm not looking for optimizations specific to a certain programming language, but general techniques for this type of regex implemenations.

Edit: As mentioned in the first link, the Oniguruma library implements a regex matcher with exactly that stack-based backtracking approach. Perhaps someone can explain the optimizations done by that library which can be generalized to other implementations. Unfortunately, the library doesn't seem to provide any documentation on the source code and the code also lacks comments.

Edit 2: When reading about parsing expression grammars (PEGs), I stumbled upon a paper on a Lua PEG implementation which makes use of a similar VM-based approach. The paper mentions several optimization options to reduce the number of executed VM commands and an unnecessary growth of the backtracking stack.

609

asked Jul 31 '19 13:07

siracusa

1 Answers

I suggest you to watch full lection, it is very interesting, but here is outline:

Complexity explosion in backtracking. This happens then pattern have ambiguity ([a-x]*[a-x0-9]*z in video, as an example) in it, so engine have to backtrack and test all conditions, until it become certain the pattern did (or didn't) match.

It can take up to O(Nᵖ), where p is "measure of ambiguity" of pattern. To get O(pN), we need to avoid evaluating equivalent threads again and again.

...

Solution: At one step ajust all threads by one character, "Breadth-first" execution results in linear comlexity.
Tricks to save every bit of performance
Inside std::regex

Hope this helps!

P.S Lector's repository

160

answered Nov 25 '22 11:11

nonForgivingJesus

Related questions
                            
                                Regular expressions in ignored_words in Sublime Text 3 spell_check?
                            
                                Search and replace C# expression
                            
                                Java Regex to find particular file
                            
                                why I cannot suppress warning with regex using warnings.filterwarnings
                            
                                Filtering logs with regex in java
                            
                                IntelliJ says "\\" (match on single backslash) is an illegal / unsupported escape sequence for Pattern.compile
                            
                                JavaScript: split doesn't work in IE?
                            
                                Can regexes containing nongreedy (reluctant) quantifiers be rewritten to use only greedy ones?
                            
                                Convert url to html <a> tag and image url to <img> tag with javascript regular expressions
                            
                                Can A MatchCollection hang the program when trying to iterate it?
                            
                                Laravel validation regex, breaks in view
                            
                                Pattern matching on a string that already has wildcards in it
                            
                                Splitting sentences with nltk while preserving quotes
                            
                                Match anything that is a something?
                            
                                Is \0 ("\\0" in a C-style regex string) a valid escape sequence in C++ regular expressions?
                            
                                Capturing first number in text with regex in javascript
                            
                                multi-byte characters in libc regcomp and regexec
                            
                                Match x out of y groups in Java regex
                            
                                Ignore everything in a directory except one subfolder
                            
                                Why does the performance of this non-matching pattern scale with the size of the search space?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Optimization techniques for backtracking regex implementations

Tags:

regex

optimization

vm-implementation

oniguruma

siracusa

People also ask

1 Answers

nonForgivingJesus

Recent Activity

Donate For Us