Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Negative lookbehind in regex

(Note: not a duplicate of Why can't you use repetition quantifiers in zero-width look behind assertions; see end of post.)

I'm trying to write a grep -P (Perl) regex that matches B, when it is not preceded by A -- regardless of whether there is intervening whitespace.

So, I tried this negative lookbehind, and tested it in regex101.com:

(?<!A)\s*B

This causes "AB" not to be matched, which is good, but "A B" does result in a match, which is not what I want.

I am not exactly sure why this is. It has something to do with the fact that \s* matches the empty string "", and you can say that there are, as such, infinity matches of \s* between A and B. But why does this affect "A B" but not "AB"?

Is the following regex a proper solution, and if so, why exactly does it fix the problem?

(?<![A\s])\s*B

I posted this before and it was incorrectly marked as a duplicate question. The variable-length thing I'm looking for is part of the match, not part of the negative lookbehind itself -- so this quite different from the other question. Yes, I could put the \s* inside the negative lookbehind, but I haven't done so (and doing so is not supported, as the other question explains). Also, I am particularly interested in why the alternate regex I post above works, since I know it works but I'm not exactly sure why. The other question did not help answer that.

like image 974
std_answ Avatar asked Mar 29 '17 21:03

std_answ


People also ask

What is the syntax of negative look behind in regex?

The syntax of a negative lookbehind is / (?<!element)match / Where match is the item to match and element is the character, characters or group in regex which must not precede the match, to declare it a successful match. So if you want to avoid matching a token if a certain token precedes it you may use negative lookbehind.

What is the difference between lookahead and lookbehind in regex?

Lookbehind means to check what is before your regex match while lookahead means checking what is after your match. And the presence or absence of an element before or after match item plays a role in declaring a match.

How to use positive look ahead in regex?

First of all the regex engine will start searching for an a in the string from left to right. When it matches an a, which is after is in the sentence then the positive lookahead process starts. After matching a the engine enters the positive lookahead and it notes that now it is going to match a positive lookahead.

What is the syntax for positive lookbehind in JavaScript?

The syntax for positive lookbehind is / (?<=element)match / Where match is the word to match and element is the item or token to check which lies before match item. The whole lookbehind expression is a group enclosed in parenthesis.


1 Answers

But why does this affect "A B" but not "AB"?

Regexes match at a position, which it is helpful to think of as being between characters. In "A B" there is a position (after the space and before the B) where (?<!A) succeeds (because there isn't an A immediately preceding; there's a space instead), and \s*B succeeds (\s* matches the empty string, and B matches B), so the entire pattern succeeds.

In "AB" there is no such position. The only place where \s*B can match (immediately before the B), is also immediately after the A, so (?<!A) cannot succeed. There are no positions that satisfy both, so the pattern as a whole can't succeed.

Is the following regex a proper solution, and if so, why exactly does it fix the problem?

(?<![A\s])\s*B

This works because (?<![A\s]) will not succeed immediately after an A or after a space. So now the lookbehind forbids any match position that has spaces before it. If there are any spaces before the B, they have to be consumed by the \s* portion of the pattern, and the match position must be before them. If that position also doesn't have an A before it, the lookbehind can succeed and the pattern as a whole can match.

This is a trick that's made possible by the fact that \s is a fixed-width pattern that matches at every position inside of a non-empty \s* match. It can't be extended to the general case of any pattern between the (non-)A and the B.

like image 196
hobbs Avatar answered Oct 02 '22 04:10

hobbs