I'm learning both compiler principles(whose regex can always do task in O(n)
) and general regular expressions. I notice that some regex may have catastrophic backtracking, which seems to be in conflict with theory.
Theoretically, ^(?:a+)+$
can be turned into an NFA like:
And by algorithm, it can be turned into a DFA that has exact states as a+
. But in real life, it will cause catastrophic backtracking for e.g. aaaaaab
. Why cannot the regex be compiled into an efficient DFA? Or generally, a regex composed of |,*,+,(?:)
should be equivalent to some non-backtracking DFA, which is at most O(n^2)
if we start from every possible character and fail every time, but why can it be like exponential complexity? Are there any differences between regex in compiler principle and general regex used in the program?
There are DFA regex engines and there are backtracking regex engines. Generally speaking, a language or library either decides that it wants features that can only be supported with a backtracking engine (like backreferences or arbitrary lookaround), in which case it compiles to that sort of structure — or it decides that it can do without such features and compiles to a DFA.
Although it would be possible to choose the matching strategy on a pattern-by-pattern basis, or to apply both strategies (using a DFA to weed out definite non-matches before feeding the remainder to a backtracking engine), for reasons of complexity this is pretty much never done.
So, a DFA engine can't "catastrophically backtrack" on your pattern. A backtracking one might or might not. It's completely possible for it to analyze the pattern and recognize that ^(?:a+)+$
can only match if a$
can, and fail the match early, but that's a game of halting problem whack-a-mole: how many cases do you cover, how much code do you write, and how much work do you spend pre-processing in order to maybe short-circuit a brute-force match?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With