Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does this backreference not work inside a lookbehind?

Tags:

Matching a repeated character in regex is simple with a backreference:

(.)\1 

Test it here.

However, I would like to match the character after the pair of characters, so I thought I could simply put this in a lookbehind:

(?<=(.)\1). 

Unfortunately, this doesn't match anything.

Why is that? In other flavours I wouldn't be surprised because there are strong restrictions on lookbehinds, but .NET usually supports arbitrarily complicated patterns inside lookbehinds.

like image 668
Martin Ender Avatar asked Mar 16 '16 22:03

Martin Ender


People also ask

How does regex Lookbehind work?

Lookbehind has the same effect, but works backwards. It tells the regex engine to temporarily step backwards in the string, to check if the text inside the lookbehind can be matched there. (? <!a)b matches a “b” that is not preceded by an “a”, using negative lookbehind.

Does JavaScript regex support Lookbehind?

JavaScript doesn't support any lookbehind, but it can support lookaheads.

What is Lookbehind assertion?

Regex Lookbehind is used as an assertion in Python regular expressions(re) to determine success or failure whether the pattern is behind i.e to the right of the parser's current position. They don't match anything. Hence, Regex Lookbehind and lookahead are termed as a zero-width assertion.

What is lookahead and Lookbehind?

Lookahead allows to add a condition for “what follows”. Lookbehind is similar, but it looks behind. That is, it allows to match a pattern only if there's something before it.


1 Answers

The short version: Lookbehinds are matched from right to left. That means when the regex engine encounters the \1 it hasn't captured anything into that group yet, so the regex always fails. The solution is quite simple:

(?<=\1(.)). 

Test it here.

Unfortunately, the full story once you start using more complex patterns is a lot more subtle. So here is...

A guide to reading regular expressions in .NET

First, some important acknowledgements. The person who taught me that lookbehinds are matched from right to left (and figured this out on his own through a lot of experimentation), was Kobi in this answer. Unfortunately, the question I asked back then was a very convoluted example which doesn't make for a great reference for such a simple problem. So we figured it would make sense to make a new and more canonical post for future reference and as a suitable dupe target. But please consider giving Kobi an upvote for figuring out a very important aspect of .NET's regex engine that is virtually undocumented (as far as I know, MSDN mentions it in a single sentence on a non-obvious page).

Note that rexegg.com explains the inner workings of .NET's lookbehinds differently (in terms of reversing the string, the regex, and any potential captures). Although that wouldn't make a difference to the result of the match, I find that approach much harder to reason about, and from looking at the code it's fairly clear that this is not what the implementation actually does.

So. The first question is, why is it actually more subtle than the bolded sentence above. Let's try matching a character that is preceded by either a or A using a local case-insensitive modifier. Given the right-to-left matching behaviour, one might expect this to work:

(?<=a(?i)). 

However, as you can see here this doesn't seem to use the modifier at all. Indeed, if we put the modifier in front:

(?<=(?i)a). 

...it works.

Another example, that might be surprising with right-to-left matching in mind is the following:

(?<=\2(.)(.)). 

Does the \2 refer to the left or right capturing group? It refers to the right one, as this example shows.

A final example: when matched against abc, does this capture b or ab?

(?<=(b|a.))c 

It captures b. (You can see the captures on the "Table" tab.) Once again "lookbehinds are applied from right to left" isn't the full story.

Hence, this post tries to be a comprehensive reference on all things regarding directionality of regex in .NET, as I'm not aware of any such resource. The trick to reading a complicated regex in .NET is doing so in three or four passes. All but the last pass are left-to-right, regardless of lookbehinds or RegexOptions.RightToLeft. I believe this is the case, because .NET processes these when parsing and compiling the regex.

First pass: inline modifiers

This is basically what the above example shows. If anywhere in your regex, you had this snippet:

...a(b(?i)c)d... 

Regardless of where in the pattern that is or whether you're using the RTL option, c will be case-insensitive while a, b and d will not (provided they aren't affected by some other preceding or global modifier). That is probably the simplest rule.

Second pass: group numbers [unnamed groups]

For this pass you should completely ignore any named groups in the pattern, i.e. those of the form (?<a>...). Note this does not include groups with explicit numbers like (?<2>...) (which are a thing in .NET).

Capturing groups are numbered from left to right. It doesn't matter how complicated your regex is, whether you're using the RTL option or whether you nest dozens of lookbehinds and lookaheads. When you're only using unnamed capturing groups, they are numbered from left to right depending on the position of their opening parenthesis. An example:

(a)(?<=(b)(?=(.)).((c).(d)))(e) └1┘    └2┘   └3┘  │└5┘ └6┘│ └7┘                   └───4───┘ 

This gets a bit trickier when mixing unlabelled groups with explicitly numbered groups. You should still read all of these from left to right, but the rules are a bit trickier. You can determine the number of a group as follows:

  • If the group has an explicit number, its number is obviously that (and only that) number. Note that this may either add an additional capture to an already existing group number, or it may create a new group number. Also note that when you're giving explicit group numbers, they don't have to be consecutive. (?<1>.)(?<5>.) is a perfectly valid regex with group number 2 to 4 unused.
  • If the group is unlabelled, it takes the first unused number. Due to the gaps I just mentioned, this may be smaller than the maximum number that has already been used.

Here is an example (without nesting, for simplicity; remember to order them by their opening parentheses when they are nested):

(a)(?<1>b)(?<2>c)(d)(e)(?<6>f)(g)(h) └1┘└──1──┘└──2──┘└3┘└4┘└──6──┘└5┘└7┘ 

Notice how the explicit group 6 creates a gap, then the group capturing g takes that unused gap between groups 4 and 6, whereas the group capturing h takes 7 because 6 is already used. Remember that there might be named groups anywhere in between these, which we're completely ignoring for now.

If you're wondering what the purpose of repeated groups like group 1 in this example is, you might want to read about balancing groups.

Third pass: group numbers [named groups]

Of course, you can skip this pass entirely if there are no named groups in the regex.

It's a little known feature that named groups also have (implicit) group numbers in .NET, which can be used in backreferences and substitution patterns for Regex.Replace. These get their numbers in a separate pass, once all the unnamed groups have been processed. The rules for giving them numbers are as follows:

  • When a name appears for the first time, the group gets the first unused number. Again, this might be a gap in the used numbers if the regex uses explicit numbers, or it might be one greater than the greatest group number so far. This permanently associates this new number with the current name.
  • Consequently, when a name appears again in the regex, the group will have the same number that was used for that name the last time.

A more complete example with all three types of groups, explicitly showing passes two and three:

         (?<a>.)(.)(.)(?<b>.)(?<a>.)(?<5>.)(.)(?<c>.) Pass 2:  │     │└1┘└2┘│     ││     │└──5──┘└3┘│     │ Pass 3:  └──4──┘      └──6──┘└──4──┘          └──7──┘ 

Final pass: following the regex engine

Now that we know which modifiers apply to which tokens and which groups have which numbers, we finally get to the part that actually corresponds to the execution of the regex engine, and where we start going back and forth.

.NET's regex engine can process regex and string in two directions: the usual left-to-right mode (LTR) and its unique right-to-left mode (RTL). You can activate RTL mode for the entire regex with RegexOptions.RightToLeft. In that case, the engine will start trying to find a match at the end of the string and will go left through the regex and the string. For example, the simple regex

a.*b 

Would match a b, then it would try to match .* to the left of that (backtracking as necessary) such that there's an a somewhere to the left of it. Of course, in this simple example, the result between LTR and RTL mode is identical, but it helps to make a conscious effort to follow the engine in its backtracking. It can make a difference for something as simple as ungreedy modifiers. Consider the regex

a.*?b 

instead. We're trying to match axxbxxb. In LTR mode, you get the match axxb as expected, because the ungreedy quantifier is satisfied with the xx. However, in RTL mode, you'd actually match the entire string, since the first b is found at the end of the string, but then .*? needs to match all of xxbxx for a to match.

And clearly it also makes a difference for backreferences, as the example in the question and at the top of this answer shows. In LTR mode we use (.)\1 to match repeated characters and in RTL mode we use \1(.), since we need to make sure that the regex engine encounters the capture before it tries to reference it.

With that in mind, we can view lookarounds in a new light. When the regex engine encounters a lookbehind, it processes it as follows:

  • It remembers its current position x in the target string as well as its current processing direction.
  • Now it enforces RTL mode, regardless of the mode it's currently in.
  • Then the contents of the lookbehind are matched from right to left, starting from the current position x.
  • Once the lookbehind is processed completely, if it passed, the position of the regex engine resets to position x and the original processing direction is restored.

While a lookahead seems a lot more innocuous (since we almost never encounter problems like the one in the question with them), its behaviour is actually virtually the same, except that it enforces LTR mode. Of course in most patterns which are LTR only, this is never noticed. But if the regex itself is matched in RTL mode, or we're doing something as crazy as putting a lookahead inside a lookbehind, then the lookahead will change the processing direction just like the lookbehind does.

So how should you actually read a regex that does funny stuff like this? The first step is to split it into separate components, which are usually individual tokens together with their relevant quantifiers. Then depending on whether the regex is LTR or RTL, start going from top to bottom or bottom to top, respectively. Whenever you encounter a lookaround in the process, check which way its facing and skip to the correct end and read the lookaround from there. When you're done with the lookaround, continue with the surrounding pattern.

Of course there's another catch... when you encounter an alternation (..|..|..), the alternatives are always tried from left to right, even during RTL matching. Of course, within each alternative, the engine proceeds from right to left.

Here is a somewhat contrived example to show this:

.+(?=.(?<=a.+).).(?<=.(?<=b.|c.)..(?=d.|.+(?<=ab*?))). 

And here is how we can split this up. The numbers on the left show the reading order if the regex is in LTR mode. The numbers on the right show the reading order in RTL mode:

LTR             RTL   1  .+          18     (?=  2    .         14       (?<=  4      a       16  3      .+      17       )  5    .         13     )  6  .           13     (?<= 17    .         12       (?<= 14      b        9 13      .        8       | 16      c       11 15      .       10       ) 12    ..         7       (?=  7      d        2  8      .        3       |  9      .+       4         (?<= 11        a      6 10        b*?    5         )       )     ) 18  .            1 

I sincerely hope that you'll never use something as crazy as this in production code, but maybe one day a friendly colleague will leave some crazy write-only regex in your company's code base before being fired, and on that day I hope that this guide might help you figure out what the hell is going on.

Advanced section: balancing groups

For the sake of completeness, this section explains how balancing groups are affected by the directionality of the regex engine. If you don't know what balancing groups are, you can safely ignore this. If you want to know what balancing groups are, I've written about it here, and this section assumes that you know at least that much about them.

There are three types of group syntax that are relevant for balancing groups.

  1. Explicitly named or numbered groups like (?<a>...) or (?<2>...) (or even implicitly numbered groups), which we've dealt with above.
  2. Groups that pop from one of the capture stacks like (?<-a>...) and (?<-2>...). These behave as you'd expect them to. When they're encountered (in the correct processing order described above), they simply pop from the corresponding capture stack. It might be worth noting that these don't get implicit group numbers.
  3. The "proper" balancing groups (?<b-a>...) which are usually used to capture the string since the last of b. Their behaviour gets weird when mixed with right-to-left mode, and that's what this section is about.

The takeaway is, the (?<b-a>...) feature is effectively unusable with right-to-left mode. However, after a lot of experimentation, the (weird) behaviour actually appears to follow some rules, which I'm outlining here.

First, let's look at an example which shows why lookarounds complicate the situation. We're matching the string abcde...wvxyz. Consider the following regex:

(?<a>fgh).{8}(?<=(?<b-a>.{3}).{2}) 

Reading the regex in the order I presented above, we can see that:

  1. The regex captures fgh into group a.
  2. The engine then moves 8 characters to the right.
  3. The lookbehind switches to RTL mode.
  4. .{2} moves two characters to the left.
  5. Finally, (?<b-a>.{3}) is the balancing group which pops the capture off group a and pushes something onto group b. In this case, the group matches lmn and we push ijk onto group b as expected.

However, it should be clear from this example, that by changing the numerical parameters, we can change the relative position of the substrings matched by the two groups. We can even make those substrings intersect, or have one contained completely inside the other by making the 3 smaller or larger. In this case it's no longer clear what it means to push everything between the two matched substrings.

It turns out that there are three cases to distinguish.

Case 1: (?<a>...) matches left of (?<b-a>...)

This is the normal case. The top capture is popped from a and everything between the substrings matched by the two groups is pushed onto b. Consider the following two substrings for the two groups:

abcdefghijklmnopqrstuvwxyz    └──<a>──┘  └──<b-a>──┘ 

Which you might get with the regex

(?<a>d.{8}).+$(?<=(?<b-a>.{11}).) 

Then mn would be pushed onto b.

Case 2: (?<a>...) and (?<b-a>...) intersect

This includes the case where the two substrings touch, but don't contain any common characters (only a common boundary between characters). This can happen if one of the groups is inside a lookaround and the other one is not or is inside a different lookaround. In this case the intersection of both subtrings will be pushed onto b. This is still true when substring is completely contained inside the other.

Here are several examples to show this:

        Example:              Pushes onto <b>:    Possible regex:  abcdefghijklmnopqrstuvwxyz    ""                  (?<a>d.{8}).+$(?<=(?<b-a>.{11})...)    └──<a>──┘└──<b-a>──┘  abcdefghijklmnopqrstuvwxyz    "jkl"               (?<a>d.{8}).+$(?<=(?<b-a>.{11}).{6})    └──<a>┼─┘       │          └──<b-a>──┘  abcdefghijklmnopqrstuvwxyz    "klmnopq"           (?<a>k.{8})(?<=(?<b-a>.{11})..)       │   └──<a>┼─┘       └──<b-a>──┘  abcdefghijklmnopqrstuvwxyz    ""                  (?<=(?<b-a>.{7})(?<a>.{4}o))    └<b-a>┘└<a>┘  abcdefghijklmnopqrstuvwxyz    "fghijklmn"         (?<a>d.{12})(?<=(?<b-a>.{9})..)    └─┼──<a>──┼─┘      └─<b-a>─┘  abcdefghijklmnopqrstuvwxyz    "cdefg"             (?<a>c.{4})..(?<=(?<b-a>.{9})) │ └<a>┘ │ └─<b-a>─┘ 

Case 3: (?<a>...) matches right of (?<b-a>...)

This case I don't really understand and would consider a bug: when the substring matched by (?<b-a>...) is properly left of the substring matched by (?<a>...) (with at least one character between them, such that they don't share a common boundary), nothing is pushed b. By that I really mean nothing, not even an empty string — the capture stack itself remains empty. However, matching the group still succeeds, and the corresponding capture is popped off the a group.

What's particularly annoying about this is that this case would likely be a lot more common than case 2, since this is what happens if you try to use balancing groups the way they were meant to be used, but in a plain right-to-left regex.

Update on case 3: After some more testing done by Kobi it turns out that something happens on stack b. It appears that nothing is pushed, because m.Groups["b"].Success will be False and m.Groups["b"].Captures.Count will be 0. However, within the regex, the conditional (?(b)true|false) will now use the true branch. Also in .NET it seems to be possible to do (?<-b>) afterwards (after which accessing m.Groups["b"] will throw an exception), whereas Mono throws an exception immediately while matching the regex. Bug indeed.

like image 54
Martin Ender Avatar answered Oct 05 '22 12:10

Martin Ender