<p>I'm debugging a Regular Expression <code>^(A+)*B</code> over a string <code>AAAC</code> (example from rexegg.com) by two separate debugging tools I have access:</p> <ol> <li>regex101.com</li> <li>RegexBuddy v4</li> </ol> <p>Below is the results (regex101 in the left side):</p> <p><img src="https://i.stack.imgur.com/HeofB.png" alt="tool outps (regex101 on the left)"></p> <p>The question I have is not mainly about the number of steps which is also important to me, but is how backtracks are done. Why do we see differences? (regex101 uses PCRE lib and I set RegexBuddy lib the same)</p> <p>A comprehensive step by step explanation is really in my favor.</p>

<h3>TL;DR</h3> <p>In short, "backtracking" is when a regex engine returns to a "flexible" match, attempting a different path to get a successful match.</p> <h3>Backtracking with Alternation</h3> <p>For example, in the following pattern and input:</p> <pre class="prettyprint"><code>foo(bar|baz) </code></pre> <pre class="prettyprint"><code>foobaz </code></pre> <p>The regex engine will match "foo", then attempt the first of the two options, matching "b" and then "a", but fails at "r". Rather than failing the whole match, though, it will "rewind the tape" and start with the second alternative, matching "b" then "a" and then "z"... success!</p> <h3>Backtracking with Quantifiers</h3> <p>This also works with quantifiers. A quantifier is anything that encourages the engine to match a repeating pattern, including <code>?</code>, <code>*</code>, <code>+</code> and <code>{n,m}</code> (depending on the engine).</p> <p>A greedy quantifier (the default) will match as many repetitions as possible before moving on to the rest of the pattern. For example, given the pattern and input below:</p> <pre class="prettyprint"><code>\w+bar </code></pre> <pre class="prettyprint"><code>foobar </code></pre> <p>The pattern <code>\w+</code> will begin by matching the entire string: "foobar". However, when it moves on to the <code>b</code>, the regex engine has reach the end of the input and the match fails. Rather than simply giving up, the engine will ask the last greedy quantifier to give up one of its repetitions, now matching "fooba". The match still fails, so the engine asks <code>\w+</code> to give up the "a" (failure), and then the "b". After giving up the "b", the engine can now match <code>bar</code>, and the match succeeds.</p> <h3>Trees and Backtracking</h3> <p>Another way of thinking of a regex is as a "tree", and backtracking is going back up a node and trying another path. Given the pattern <code>foo(bar|baz)</code> and the input "foobaz", the engine will attempt something like the following:</p> <pre class="prettyprint"><code>foo(bar|baz) |\ | \ | : f (match) | : o (match) | : o (match) | | (bar|baz) | |\ | | \ | | : b (match) | | : a (match) | | : r (FAIL: go back up a level) | | ^ | |\ | | \ | | : b (match) | | : a (match) | | : z (match) | | / | |/ | / |/ SUCCESS </code></pre> <h3>Counting the "Backtracks"</h3> <p>As to why you see differences in the "number" of backtracks... this probably has a lot to do with internal optimizations and logging level. For example, RegexBuddy does not appear to be logging the match to the empty string before <code>^</code>, while regex101 does. In the end, though, it doesn't really matter what exact order you backtrack in (what order you climb back up the tree) so long as you end up with the same result.</p> <h3>Evil Regexes</h3> <p>You already know this, but for the benefit of anyone else who happens by, your regex was written to demonstrate "catastrophic backtracking" (aka "evil regex"), where the number of backtrack attempts grows exponentially as the length of the input increases. These regexes can be exploited to perform DoS attacks, so you must use caution not to introduce these into your patterns (as I found out).</p>

RegEx debugging

Tags:

I'm debugging a Regular Expression ^(A+)*B over a string AAAC (example from rexegg.com) by two separate debugging tools I have access:

regex101.com
RegexBuddy v4

Below is the results (regex101 in the left side):

tool outps (regex101 on the left)

The question I have is not mainly about the number of steps which is also important to me, but is how backtracks are done. Why do we see differences? (regex101 uses PCRE lib and I set RegexBuddy lib the same)

A comprehensive step by step explanation is really in my favor.

777

asked Jun 19 '16 21:06

revo

1 Answers

TL;DR

In short, "backtracking" is when a regex engine returns to a "flexible" match, attempting a different path to get a successful match.

Backtracking with Alternation

For example, in the following pattern and input:

foo(bar|baz)

foobaz

The regex engine will match "foo", then attempt the first of the two options, matching "b" and then "a", but fails at "r". Rather than failing the whole match, though, it will "rewind the tape" and start with the second alternative, matching "b" then "a" and then "z"... success!

Backtracking with Quantifiers

This also works with quantifiers. A quantifier is anything that encourages the engine to match a repeating pattern, including ?, *, + and {n,m} (depending on the engine).

A greedy quantifier (the default) will match as many repetitions as possible before moving on to the rest of the pattern. For example, given the pattern and input below:

\w+bar

foobar

The pattern \w+ will begin by matching the entire string: "foobar". However, when it moves on to the b, the regex engine has reach the end of the input and the match fails. Rather than simply giving up, the engine will ask the last greedy quantifier to give up one of its repetitions, now matching "fooba". The match still fails, so the engine asks \w+ to give up the "a" (failure), and then the "b". After giving up the "b", the engine can now match bar, and the match succeeds.

Trees and Backtracking

Another way of thinking of a regex is as a "tree", and backtracking is going back up a node and trying another path. Given the pattern foo(bar|baz) and the input "foobaz", the engine will attempt something like the following:

foo(bar|baz)
|\
| \
|  : f (match)
|  : o (match)
|  : o (match)
|  | (bar|baz)
|  |\
|  | \
|  |  : b (match)
|  |  : a (match)
|  |  : r (FAIL: go back up a level)
|  |  ^
|  |\
|  | \
|  |  : b (match)
|  |  : a (match)
|  |  : z (match)
|  | /
|  |/
| /
|/
SUCCESS

Counting the "Backtracks"

As to why you see differences in the "number" of backtracks... this probably has a lot to do with internal optimizations and logging level. For example, RegexBuddy does not appear to be logging the match to the empty string before ^, while regex101 does. In the end, though, it doesn't really matter what exact order you backtrack in (what order you climb back up the tree) so long as you end up with the same result.

Evil Regexes

You already know this, but for the benefit of anyone else who happens by, your regex was written to demonstrate "catastrophic backtracking" (aka "evil regex"), where the number of backtrack attempts grows exponentially as the length of the input increases. These regexes can be exploited to perform DoS attacks, so you must use caution not to introduce these into your patterns (as I found out).

166

answered Nov 11 '22 05:11

JDB

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

RegEx debugging

Tags:

revo

People also ask

1 Answers

TL;DR

Backtracking with Alternation

Backtracking with Quantifiers

Trees and Backtracking

Counting the "Backtracks"

Evil Regexes

JDB

Recent Activity

Donate For Us

RegEx debugging

Tags:

revo

People also ask

1 Answers

TL;DR

Backtracking with Alternation

Backtracking with Quantifiers

Trees and Backtracking

Counting the "Backtracks"

Evil Regexes

JDB

Related questions

Recent Activity

Donate For Us