I am looking for a non-technical explanation of the difference between DFA vs NFA engines, based on their capabilities and limitations.

A simple, nontechnical explanation, paraphrased from Jeffrey Friedl's book Mastering Regular Expressions. CAVEAT: While this book is generally considered the "regex bible", there appears some controversy as to whether the distinction made here between DFA and NFA is actually correct. I'm not a computer scientist, and I don't understand most of the theory behind what really is a "regular" expression, deterministic or not. After the controversy started, I deleted this answer because of this, but since then it has been referenced in comments to other answers. I would be very interested in discussing this further - can it be that Friedl really is wrong? Or did I get Friedl wrong (but I reread that chapter yesterday evening, and it's just like I remembered...)? Edit: It appears that Friedl and I are indeed wrong. Please check out Eamon's excellent comments below. <hr> Original answer: A DFA engine steps through the input string character by character and tries (and remembers) all possible ways the regex could match at this point. If it reaches the end of the string, it declares success. Imagine the string <code>AAB</code> and the regex <code>A*AB</code>. We now step through our string letter by letter. <ol> <li> <code>A</code>: <ul> <li>First branch: Can be matched by <code>A*</code>.</li> <li>Second branch: Can be matched by ignoring the <code>A*</code> (zero repetitions are allowed) and using the second <code>A</code> in the regex.</li> </ul> </li> <li> <code>A</code>: <ul> <li>First branch: Can be matched by expanding <code>A*</code>.</li> <li>Second branch: Can't be matched by <code>B</code>. Second branch fails. But:</li> <li>Third branch: Can be matched by not expanding <code>A*</code> and using the second <code>A</code> instead.</li> </ul> </li> <li> <code>B</code>: <ul> <li>First branch: Can't be matched by expanding <code>A*</code> or by moving on in the regex to the next token <code>A</code>. First branch fails.</li> <li>Third branch: Can be matched. Hooray!</li> </ul> </li> </ol> A DFA engine never backtracks in the string. <hr> An NFA engine steps through the regex token by token and tries all possible permutations on the string, backtracking if necessary. If it reaches the end of the regex, it declares success. Imagine the same string and the same regex as before. We now step through our regex token by token: <ol> <li> <code>A*</code>: Match <code>AA</code>. Remember the backtracking positions 0 (start of string) and 1.</li> <li> <code>A</code>: Doesn't match. But we have a backtracking position we can return to and try again. The regex engine steps back one character. Now <code>A</code> matches. </li> <li> <code>B</code>: Matches. End of regex reached (with one backtracking position to spare). Hooray!</li> </ol>

DFA vs NFA engines: What is the difference in their capabilities and limitations?

3 Answers

Deterministic Finite Automatons (DFAs) and Nondeterministic Finite Automatons (NFAs) have exactly the same capabilities and limitations. The only difference is notational convenience.

A finite automaton is a processor that has states and reads input, each input character potentially setting it into another state. For example, a state might be "just read two Cs in a row" or "am starting a word". These are usually used for quick scans of text to find patterns, such as lexical scanning of source code to turn it into tokens.

A deterministic finite automaton is in one state at a time, which is implementable. A nondeterministic finite automaton can be in more than one state at a time: for example, in a language where identifiers can begin with a digit, there might be a state "reading a number" and another state "reading an identifier", and an NFA could be in both at the same time when reading something starting "123". Which state actually applies would depend on whether it encountered something not numeric before the end of the word.

Now, we can express "reading a number or identifier" as a state itself, and suddenly we don't need the NFA. If we express combinations of states in an NFA as states themselves, we've got a DFA with a lot more states than the NFA, but which does the same thing.

It's a matter of which is easier to read or write or deal with. DFAs are easier to understand per se, but NFAs are generally smaller.

151

answered Sep 20 '22 14:09

David Thornley

Here's a non-technical answer from Microsoft:

DFA engines run in linear time because they do not require backtracking (and thus they never test the same character twice). They can also guarantee matching the longest possible string. However, since a DFA engine contains only finite state, it cannot match a pattern with backreferences, and because it does not construct an explicit expansion, it cannot capture subexpressions.

Traditional NFA engines run so-called "greedy" match backtracking algorithms, testing all possible expansions of a regular expression in a specific order and accepting the first match. Because a traditional NFA constructs a specific expansion of the regular expression for a successful match, it can capture subexpression matches and matching backreferences. However, because a traditional NFA backtracks, it can visit exactly the same state multiple times if the state is arrived at over different paths. As a result, it can run exponentially slowly in the worst case. Because a traditional NFA accepts the first match it finds, it can also leave other (possibly longer) matches undiscovered.

POSIX NFA engines are like traditional NFA engines, except that they continue to backtrack until they can guarantee that they have found the longest match possible. As a result, a POSIX NFA engine is slower than a traditional NFA engine, and when using a POSIX NFA you cannot favor a shorter match over a longer one by changing the order of the backtracking search.

Traditional NFA engines are favored by programmers because they are more expressive than either DFA or POSIX NFA engines. Although in the worst case they can run slowly, you can steer them to find matches in linear or polynomial time using patterns that reduce ambiguities and limit backtracking.

[http://msdn.microsoft.com/en-us/library/0yzc2yb0.aspx]

answered Sep 19 '22 14:09

james.garriss

A simple, nontechnical explanation, paraphrased from Jeffrey Friedl's book Mastering Regular Expressions.

CAVEAT:

While this book is generally considered the "regex bible", there appears some controversy as to whether the distinction made here between DFA and NFA is actually correct. I'm not a computer scientist, and I don't understand most of the theory behind what really is a "regular" expression, deterministic or not. After the controversy started, I deleted this answer because of this, but since then it has been referenced in comments to other answers. I would be very interested in discussing this further - can it be that Friedl really is wrong? Or did I get Friedl wrong (but I reread that chapter yesterday evening, and it's just like I remembered...)?

Edit: It appears that Friedl and I are indeed wrong. Please check out Eamon's excellent comments below.

Original answer:

A DFA engine steps through the input string character by character and tries (and remembers) all possible ways the regex could match at this point. If it reaches the end of the string, it declares success.

Imagine the string AAB and the regex A*AB. We now step through our string letter by letter.

A:
- First branch: Can be matched by A*.
- Second branch: Can be matched by ignoring the A* (zero repetitions are allowed) and using the second A in the regex.
A:
- First branch: Can be matched by expanding A*.
- Second branch: Can't be matched by B. Second branch fails. But:
- Third branch: Can be matched by not expanding A* and using the second A instead.
B:
- First branch: Can't be matched by expanding A* or by moving on in the regex to the next token A. First branch fails.
- Third branch: Can be matched. Hooray!

A DFA engine never backtracks in the string.

An NFA engine steps through the regex token by token and tries all possible permutations on the string, backtracking if necessary. If it reaches the end of the regex, it declares success.

Imagine the same string and the same regex as before. We now step through our regex token by token:

A*: Match AA. Remember the backtracking positions 0 (start of string) and 1.
A: Doesn't match. But we have a backtracking position we can return to and try again. The regex engine steps back one character. Now A matches.
B: Matches. End of regex reached (with one backtracking position to spare). Hooray!

answered Sep 19 '22 14:09

Tim Pietzcker

Related questions
                            
                                Remove first character from a string if it is a comma
                            
                                Regular expression for matching time in military (24 hour) format
                            
                                How to remove accents and turn letters into "plain" ASCII characters? [duplicate]
                            
                                Best Regular Expression for Email Validation in C#
                            
                                Remove all whitespace from C# string with regex
                            
                                RegExp to strip HTML comments
                            
                                United States Banking Institution Account Number Regular Expression?
                            
                                What's the use of the [\b] backspace regex?
                            
                                Why are regular expressions called "regular" expressions?
                            
                                Template literal inside of the RegEx
                            
                                Dedicated Regular Expression for Persian alphabet [duplicate]
                            
                                Regular expression to validate US phone numbers? [duplicate]
                            
                                Escaped Periods In R Regular Expressions
                            
                                How to replace the first occurrence of a regular expression in Python?
                            
                                Python: Replace with regex
                            
                                How do I strip non-alphanumeric characters (including spaces) from a string?
                            
                                What does '\K' mean in this regex?
                            
                                Removing everything except numbers in a string
                            
                                Match a string against multiple patterns
                            
                                sed - how to do regex groups using sed

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

DFA vs NFA engines: What is the difference in their capabilities and limitations?

Tags:

regex

finite-automata

dfa

nfa

blunders

People also ask

3 Answers

David Thornley

james.garriss

Tim Pietzcker

Recent Activity

Donate For Us