Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does .NET really use NFA for regular expression engine?

Article Details of Regular Expression Behavior from MSDN says, that .NET devs decide to use for regular expressions traditional NFA engine, because it is faster than POSIX NFA, but it is not clear to me, why does this pattern works exponentially slow then?

var regex = new Regex("(a|aa)*b");
var b = regex.IsMatch("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac");

This simple pattern matching take more than 30 minute to execute. But if .NET uses traditional NFA, it is possible to simulate it and find match in O(M*N) time in the worst case, where M is pattern length and N is text length, which surely is not true in this case.

Article also explains that backtracking is the reason of slow execution, but I still have some questions that can't find answers

  1. What is backtracking? is it only using already matched pattern again like this (a|b)c/1?
  2. Does traditional NFA support backtracking, if no what modification does it need?
  3. If NFA supports it, but need more slower algorithm to simulate, why .NET can't check if backtracking exist in the pattern, and use one algorithm and use another if it doesn't?
like image 534
Arsen Mkrtchyan Avatar asked Dec 12 '13 09:12

Arsen Mkrtchyan


People also ask

What regex engine does .NET use?

NET, regular expression patterns are defined by a special syntax or language, which is compatible with Perl 5 regular expressions and adds some additional features such as right-to-left matching. For more information, see Regular Expression Language - Quick Reference.

What regex does C# use?

In C#, Regular Expression is a pattern which is used to parse and check whether the given input text is matching with the given pattern or not. In C#, Regular Expressions are generally termed as C# Regex. The . Net Framework provides a regular expression engine that allows the pattern matching.

What regex flavor does r use?

R supports two regular expression flavors: POSIX 1003.2 and Perl.

How does a regex engine work?

A regex engine executes the regex one character at a time in left-to-right order. This input string itself is parsed one character at a time, in left-to-right order. Once a character is matched, it's said to be consumed from the input, and the engine moves to the next input character.


1 Answers

You can compile a regular expression to a NFA or a DFA, although the DFA calculated from an NFA may be impractically large. You can match a NFA with or without backtracking, although the schemes that work without back-tracking usually put more constraints on the regular expression language, and on which matches are found when there are many possible matches.

Your example is slow because the matcher has to decide very often whether to match with a or aa, and whether to try matching the final b. Backtracking works like running a recursive function which, at each step, makes recursive calls on itself for each possibility - recursively match with a and if that fails recursively match with aa and if that fails recursively match with b.

Microsoft seem to be saying that their sort of backtracking is faster than POSIX because POSIX backtracking will arrange for a recursive search that carries on until it is sure that it has found the longest possible match. The Microsoft version still backtracks, but it does not have extra checks that carry on until there is a guarantee that they have found the longest possible match. There is an example in http://msdn.microsoft.com/en-us/library/dsy130b4%28v=vs.110%29.aspx.

Regular expression matchers without backtracking can work by accepting input one character at a time, and keeping track of which states in the NFA are live at that time - there may be many such states. It is hard to make this work with back-references, because then the state of the NFA cannot be described by just saying whether a state is live or not.

like image 139
mcdowella Avatar answered Oct 11 '22 20:10

mcdowella