Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Negative lookahead not working after character range with plus quantifier

I am trying to implement a regex which includes all the strings which have any number of words but cannot be followed by a : and ignore the match if it does. I decided to use a negative look ahead for it.

/([a-zA-Z]+)(?!:)/gm
string: lame:joker

since i am using a character range it is matching one character at a time and only ignoring the last character before the : . How do i ignore the entire match in this case?

Link to regex101: https://regex101.com/r/DlEmC9/1

like image 314
shyam padia Avatar asked Jan 29 '23 12:01

shyam padia


1 Answers

The issue is related to backtracking: once your [a-zA-Z]+ comes to a :, the engine steps back from the failing position, re-checks the lookahead match and finds a match whenver there are at least two letters before a colon, returning the one that is not immediately followed by :. See your regex demo: c in c:real is not matched as there is no position to backtrack to, and rea in real:c is matched because a is not immediately followed with :.

Adding implicit requirement to the negative lookahead

Since you only need to match a sequence of letters not followed with a colon, you can explicitly add one more condition that is implied: and not followed with another letter:

[A-Za-z]+(?![A-Za-z]|:)
[A-Za-z]+(?![A-Za-z:])

See the regex demo. Since both [A-Za-z] and : match a single character, it makes sense to put them into a single character class, so, [A-Za-z]+(?![A-Za-z:]) is better.

Preventing backtracking into a word-like pattern by using a word boundary

As @scnerd suggests, word boundaries can also help in these situations, but there is always a catch: word boundary meaning is context dependent (see a number of ifs in the word boundary explanation).

[A-Za-z]+\b(?!:)

is a valid solution here, because the input implies the words end with non-word chars (i.e. end of string, or chars other than letter, digits and underscore). See the regex demo.

When does a word boundary fail?

\b will not be the right choice when the main consuming pattern is supposed to match even if glued to other word chars. The most common example is matching numbers:

  • \d+\b(?!:) matches 12 in 12,, but not in 12:, and also 12c and 12_
  • \d+(?![\d:]) matches 12 in 12, and 12c and 12_, not in 12: only.
like image 152
Wiktor Stribiżew Avatar answered Feb 05 '23 17:02

Wiktor Stribiżew