Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do backreferences in regexes make backtracking required?

Tags:

I read http://swtch.com/~rsc/regexp/regexp1.html and in it the author says that in order to have backreferences in regexs, one needs backtracking when matching, and that makes the worst-case complexity exponential. But I don't see exactly why backreferences introduce the need for backtracking. Can someone explain why, and perhaps provide an example (regex and input)?

like image 627
oskarkv Avatar asked Jun 19 '12 12:06

oskarkv


People also ask

What is backtracking in regex?

Backtracking occurs when a regular expression pattern contains optional quantifiers or alternation constructs, and the regular expression engine returns to a previous saved state to continue its search for a match.

How does regex replace work?

The REGEXREPLACE( ) function uses a regular expression to find matching patterns in data, and replaces any matching values with a new string. standardizes spacing in character data by replacing one or more spaces between text characters with a single space.

What does backslash 1 mean in regex?

The backreference \1 (backslash one) references the first capturing group. \1 matches the exact same text that was matched by the first capturing group. The / before it is a literal character.

What is catastrophic backtracking?

Catastrophic backtracking is a condition that can occur if you're checking a (usually long) string against a complex regular expression. The problem usually occurs if something towards the end of the string causes the string to not match.


1 Answers

To get directly at your question, you should make a short study of the Chomsky Hierarchy. This is an old and beautiful way of organizing formal languages in sets of increasing complexity. The lowest rung of the hierarchy is the Regular Languages. You might guess - and you'd be right - that the RL's are exactly those that can be represented with "pure" regular expressions: Those with only the alphabet, empty string, concatenation, alternation |, and Kleene star * (look Ma, no back references). A classic theorem of formal language theory - Kleene's Theorem - is that DFAs, NFAs (as described in the article you cited), and regular expressions all have exactly the same power to represent and recognize languages. Thompson's construction given in the article is a part of the theorem's proof.

Every RL is also a CFL. But there are infinitely many CFLs that aren't regular. A feature that can exist in CFL's that makes them too complex to be regular is balanced pairs of things: parentheses, begin-end blocks, etc. Nearly all programming languages are CFLs. CFLs can be efficiently recognized by what's called a pushdown automaton This essentially a NFA with a stack glued on. The stack grows to be as big as needed, so it's no longer a finite automaton. Parsers of real programming languages are nearly all variations on pushdown automata.

Consider the regex with backreference

^(b*a)\1$ 

In words, this represents strings of length 2n for some n, where both the n'th and 2n'th characters are a and all other characters are b. This is a perfect example of a a CFL that's not regular. You can rigorously prove this with another cool formal language tool called the pumping lemma.

This is exactly why back references cause problems! They allow "regular expressions" that represent languages that aren't regular. Therefore there is no NFA or DFA that can ever recognize them.

But wait, it's even worse than I've made it out to be so far. Consider

^(b*a)\1\1$ 

We now have a string of length 3n where the n'th, 2n'th, and 3n'th elements are a and all others are b. There is another flavor of the pumping lemma that allows a proof that this language is even too complex to be a CFL! No pushdown automaton can recognize this one.

Back references allow these supercharged regexes to represent languages that are three rungs up the Chomsky Hierarchy: the Context Sensitive Languages. Roughly speaking, the only way to recognize a CSL is to check all strings in the language of equal length (at least if P!=NP, but that's true for all practical purposes and a different story altogether). The number of such strings is exponential in the length of the one you're matching.

This is why the searching regex matcher is needed. You can be very clever in the way you design the search. But there will always be some input that drives it to take exponential time.

So I agree with the author of the paper you cited. It's possible to write perfectly innocent looking regexes with no back refs that will be efficiently recognized for nearly all inputs, but where there exists some input that causes a Perl or Java or Python regex matcher - because it is a backtracking search - to require millions of years to complete the match. This is crazy. You can have a script that's correct and works fine for years and then locks up one day merely because it stumbled onto one of the bad inputs. Suppose the regex is buried in the message parser of the navigation system in the airplane you're riding...

Edit

By request, I'll sketch how the Pumping lemma can be used to prove the language a^k b a^k b is not regular. Here a^k is a shorthand for a repeated k times. The PL says that there must exist a positive integer N such that every string in a regular language of length at least N must be of the form R S T such that R S^k T are also in the language for all natural k. Here R, S, T are strings and S may not be empty.

Proof of the PL depends on the fact that every regular language corresponds to some DFA. An accepted input to this DFA longer than its number of states (which equates to L in the lemma) must cause it to "loop:" to repeat a state. Call this state X. The machine consumes some string R to get from the start to X, then S to loop back to X, then T to get to an accepting state. Well, adding extra copies of S (or else deleting S) in the input correspond only to a different number of "loops" from X back to X. Consequently, the new string with additional (or deleted) copies of S will also be accepted.

Since every RL must satisfy the PL, a proof that a language is not regular proceeds by showing that it contradicts the PL. For our language, this is not hard. Suppose you are trying to convince me the language L = a^k b a^k b satisfies the PL. Because it does so, you must be able to give me some value of N (see above): the number of states in a hypothetical DFA that recognizes L. At that point, I will say, "Okay Mr. Regular Guy, consider the string B=a^N b a^N b." If L is regular, B must cause this DFA (no matter what it looks like) to loop within the first N characters, which must be all as! So the loop (string S above) consists of all as, also. With this I can immediately show that your claim about L being regular is false. I just choose to go around the loop a second time. This will cause this hypothetical DFA of yours to accept a new string a^M b a^N b, where M>N because I added as to its first half. Ouch! This new string is not in L, so the PL is not true after all. Since I can do this trick every time no matter what N you provided, the PL cannot hold for L, and L cannot be regular after all.

Since it's not regular, Kleene's theorem tells us there is no DFA nor FA nor "pure" regex that describes it.

The proof that back refs allow languages that aren't even context free has a very similar ring but needs background on pushdown automata that I'm not going to give here. Google will provide.

NB: Both of these fall short of proof that back refs make recognition NP complete. They merely say in a very rigorous way that back refs add real complexity to pure regular expressions. They allow languages that can't be recognized with any machine having finite memory, nor any with only an infinitely large LIFO memory. I will leave NP completeness proof to others.

like image 131
Gene Avatar answered Nov 08 '22 17:11

Gene