Article Details of Regular Expression Behavior from MSDN says, that .NET devs decide to use for regular expressions traditional NFA engine, because it is faster than POSIX NFA, but it is not clear to me, why does this pattern works exponentially slow then?
var regex = new Regex("(a|aa)*b");
var b = regex.IsMatch("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac");
This simple pattern matching take more than 30 minute to execute. But if .NET uses traditional NFA, it is possible to simulate it and find match in O(M*N) time in the worst case, where M is pattern length and N is text length, which surely is not true in this case.
Article also explains that backtracking is the reason of slow execution, but I still have some questions that can't find answers
(a|b)c/1
?NET, regular expression patterns are defined by a special syntax or language, which is compatible with Perl 5 regular expressions and adds some additional features such as right-to-left matching. For more information, see Regular Expression Language - Quick Reference.
In C#, Regular Expression is a pattern which is used to parse and check whether the given input text is matching with the given pattern or not. In C#, Regular Expressions are generally termed as C# Regex. The . Net Framework provides a regular expression engine that allows the pattern matching.
R supports two regular expression flavors: POSIX 1003.2 and Perl.
A regex engine executes the regex one character at a time in left-to-right order. This input string itself is parsed one character at a time, in left-to-right order. Once a character is matched, it's said to be consumed from the input, and the engine moves to the next input character.
You can compile a regular expression to a NFA or a DFA, although the DFA calculated from an NFA may be impractically large. You can match a NFA with or without backtracking, although the schemes that work without back-tracking usually put more constraints on the regular expression language, and on which matches are found when there are many possible matches.
Your example is slow because the matcher has to decide very often whether to match with a or aa, and whether to try matching the final b. Backtracking works like running a recursive function which, at each step, makes recursive calls on itself for each possibility - recursively match with a and if that fails recursively match with aa and if that fails recursively match with b.
Microsoft seem to be saying that their sort of backtracking is faster than POSIX because POSIX backtracking will arrange for a recursive search that carries on until it is sure that it has found the longest possible match. The Microsoft version still backtracks, but it does not have extra checks that carry on until there is a guarantee that they have found the longest possible match. There is an example in http://msdn.microsoft.com/en-us/library/dsy130b4%28v=vs.110%29.aspx.
Regular expression matchers without backtracking can work by accepting input one character at a time, and keeping track of which states in the NFA are live at that time - there may be many such states. It is hard to make this work with back-references, because then the state of the NFA cannot be described by just saying whether a state is live or not.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With