Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex lookahead ordering

Tags:

regex

I'm pretty decent with regular expressions, and now I'm trying once again to understand lookahead and lookbehind assertions. They mostly make sense, but I'm not quite sure how the order affects the result. I've been looking at this site which places lookbehinds before the expression, and lookaheads after the expression. My question is, does this change anything? A recent answer here on SO placed the lookahead before the expression which is leading to my confusion.

like image 334
David Kanarek Avatar asked Jan 24 '10 06:01

David Kanarek


3 Answers

When tutorials introduce lookarounds, they tend to choose the simplest use case for each one. So they'll use examples like (?<!a)b ('b' not preceded by 'a') or q(?=u) ('q' followed by 'u'). It's just to avoid cluttering the explanation with distracting details, but it tends to create (or reinforce) the impression that lookbehinds and lookaheads are supposed to appear in a certain order. It took me quite a while to get over that idea, and I've seen several others afflicted with it, too.

Try looking at some more realistic examples. One question that comes up a lot involves validating passwords; for example, making sure a new password is at least six characters long and contains at least one letter and one digit. One way to do that would be:

^(?=.*[A-Za-z])(?=.*\d)[A-Za-z0-9]{6,}$

The character class [A-Za-z0-9]{6,} could match all letters or all digits, so you use the lookaheads to ensure that there's at least one of each. In this case, you have to do the lookaheads first, because the later parts of the regex have to be able to examine the whole string.

For another example, suppose you need to find all occurrences of the word "there" unless it's preceded by a quotation mark. The obvious regex for that is (?<!")[Tt]here\b, but if you're searching a large corpus, that could create a performance problem. As written, that regex will do the negative lookbehind at each and every position in the text, and only when that succeeds will it check the rest of the regex.

Every regex engine has its own strengths and weaknesses, but one thing that's true of all of them is that they're quicker to find fixed sequences of literal characters than anything else--the longer the sequence, the better. That means it can be dramatically faster to do the lookbehind last, even though it means matching the word twice:

[Tt]here\b(?<!"[Tt]here)

So the rule governing the placement of lookarounds is that there is no rule; you put them wherever they make the most sense in each case.

like image 142
Alan Moore Avatar answered Nov 01 '22 22:11

Alan Moore


It's easier to show in an example than explain, I think. Let's take this regex:

(?<=\d)(?=(.)\1)(?!p)\w(?<!q)

What this means is:

  1. (?<=\d) - make sure what comes before the match position is a digit.
  2. (?=(.)\1) - make sure whatever character we match at this (same) position is followed by a copy of itself (through the backreference).
  3. (?!p) - make sure what follows is not a p.
  4. \w - match a letter, digit or underscore. Note that this is the first time we actually match and consume the character.
  5. (?<!q) - make sure what we matched so far doesn't end with a q.

All this will match strings like abc5ddx or 9xx but not 5d or 6qq or asd6pp or add. Note that each assertion works independently. It just stops, looks around, and if all is well, allows the matching to continue.

Note also that in most (probably all) implementations, lookbehinds have the limitation of being fixed-length. You can't use repetition/optionality operators like ?, *, and + in them. This is because to match a pattern we need a starting point - otherwise we'd have to try matching each lookbehind from every point in the string.

A sample run of this regex on the string a3b5ddx is as follows:

  1. Text cursor position: 0.
    1. Try to match the first lookbehind at position -1 (since \d always matches 1 character). We can't match at negative indices, so fail and advance the cursor.
  2. Text cursor position: 1.
    1. Try to match the first lookbehind at position 0. a does not match \d so fail and advance the cursor again.
  3. Text cursor position: 2.
    1. Try to match the first lookbehind at position 1. 3 does match \d so keep the cursor intact and continue matching.
    2. Try to match the first lookahead at position 2. b matches (.) and is captured. 5 does not match \1 (which is the captured b). Therefore, fail and advance the cursor.
  4. Text cursor position: 3.
    1. Try to match the first lookbehind at position 2. b does not match \d so fail and advance the cursor again.
  5. Text cursor position: 4.
    1. Try to match the first lookbehind at position 3. 5 does match \d so keep the cursor intact and continue matching.
    2. Try to match the first lookahead at position 4. d matches (.) and is captured. The second d does match \1 (which is the first captured d). Allow the matching to continue from where we left off.
    3. Try to match the second lookahead. b at position 4 does not match p, and since this is a negative lookahead, that's what we want; allow the matching to continue.
    4. Try to match \w at position 4. b matches. Advance cursor since we have consumed a character and continue. Also mark this as the start of the match.
  6. Text cursor position: 5.
    1. Try to match the second lookbehind at position 4 (since q always matches 1 character). d does not match q which is what we want from a negative lookbehind.
    2. Realize that we're at the end of the regex and report success by returning the substring from the start of the match to the current position (4 to 5), which is d.
like image 23
Max Shawabkeh Avatar answered Nov 01 '22 21:11

Max Shawabkeh


1(?=ABC) means - look for 1, and match (but don't capture) ABC after it.
(?<=ABC)1 means - match (but don't capture) ABC before the current location, and continue to match 1.
So, normally, you'll place the lookahead after the expression and the lookbehind before it.

When we place a lookbehind after the expression, we're rechecking the string we've already matched. This is common when you have complex conditions (you can think about it as the AND of regexs). For example, take a look on this recent answer by Daniel Brückner:

.&.(?<! & )

First, you capture an ampersand between two characters. Next, you check they were both not spaces (\S&\S would not work here, the OP wanted to capture 1&_).

like image 37
Kobi Avatar answered Nov 01 '22 21:11

Kobi