I'm having trouble understanding the finer details of negative lookahead regular expressions. After reading Regex lookahead, lookbehind and atomic groups, I thought I had a good summary of negative lookaheads when I found this description:
(?!REGEX_1)REGEX_2
Match only if
REGEX_1
does not match; after checkingREGEX_1
, the search forREGEX_2
starts at the same position.
Hoping I understood the algorithm, I cooked up a two-sentence test insult; I wanted to find the sentence without a certain word. Specifically...
Insult: 'Yomama is ugly. And, she smells like a wet dog.'
Requirements:
- Test 1: Return a sentence without 'ugly'.
- Test 2: Return a sentence without 'looks'.
- Test 3: Return a sentence without 'smells'.
I assigned the test words to $arg
, and I used (?:(?![A-Z].*?$arg.*?\.))([A-Z].*?\.)
to implement the test.
(?![A-Z].*?$arg.*?\.)
is a negative lookahead to reject a sentence with the test word([A-Z].*?\.)
matches at least one sentence.The critical piece seems to be in understanding where the regex engine starts matching after processing the negative lookahead.
Expected Results:
- Test 1 ($arg = "ugly"): "And, she smells like a wet dog."
- Test 2 ($arg = "looks"): "Yomama is ugly."
- Test 3 ($arg = "smells"): "Yomama is ugly."
Actual Results:
- Test 1 ($arg = "ugly"): "And, she smells like a wet dog." (Success)
- Test 2 ($arg = "looks"): "Yomama is ugly." (Success)
- Test 3 ($arg = "smells"): Failed, no match
At first I thought Test 3 failed because ([A-Z].*?\.)
was too greedy and matched both sentences; however, (?:(?![A-Z].*?$arg.*?\.))([A-Z][^\.]*?\.)
didn't work either. Next I wondered whether there was a problem with the python negative lookahead implementation, but perl gave me exactly the same result.
Finally I found the solution, I had to reject periods in my .*?
portion of the expressions by using [^\.]*?
; so this regex works: (?:(?![A-Z][^\.]*?$arg[^\.]*?\.))([A-Z][^\.]*?\.)
However, I have another concern; "Yomama is ugly." does not have "smells" in it. So, if .*?
is supposed to be a non-greedy match, why can't I complete Test 3 with (?:(?![A-Z].*?$arg.*?\.))([A-Z].*?\.)
?
In light of @bvr's excellent suggestion to use -Mre=debug
, I will consider this some more after work. It certainly looks like Seth's description is accurate at this point. What I learned so far is that negative lookahead expressions will match whenever possible, even if I put non-greedy .*?
operators in the NLA.
import re
def test_re(arg, INSULTSTR):
mm = re.search(r'''
(?: # No grouping
(?![A-Z].*?%s.*?\.)) # Negative zero-width
# assertion: arg, followed by a period
([A-Z].*?\.) # Match a capital letter followed by a period
''' % arg, INSULTSTR, re.VERBOSE)
if mm is not None:
print "neg-lookahead(%s) MATCHED: '%s'" % (arg, mm.group(1))
else:
print "Unable to match: neg-lookahead(%s) in '%s'" % (arg, INSULTSTR)
INSULT = 'Yomama is ugly. And, she smells like a wet dog.'
test_re('ugly', INSULT)
test_re('looks', INSULT)
test_re('smells', INSULT)
#!/usr/bin/perl
sub test_re {
$arg = $_[0];
$INSULTSTR = $_[1];
$INSULTSTR =~ /(?:(?![A-Z].*?$arg.*?\.))([A-Z].*?\.)/;
if ($1) {
print "neg-lookahead($arg) MATCHED: '$1'\n";
} else {
print "Unable to match: neg-lookahead($arg) in '$INSULTSTR'\n";
}
}
$INSULT = 'Yomama is ugly. And, she smells like a wet dog.';
test_re('ugly', $INSULT);
test_re('looks', $INSULT);
test_re('smells', $INSULT);
neg-lookahead(ugly) MATCHED: 'And, she smells like a wet dog.'
neg-lookahead(looks) MATCHED: 'Yomama is ugly.'
Unable to match: neg-lookahead(smells) in 'Yomama is ugly. And, she smells like a wet dog.'
Because the lookahead is negative, this means that the lookahead has successfully matched at the current position. At this point, the entire regex has matched, and q is returned as the match.
Lookbehind, which is used to match a phrase that is preceded by a user specified text. Positive lookbehind is syntaxed like (? <=a)something which can be used along with any regex parameter. The above phrase matches any "something" word that is preceded by an "a" word.
Zero-Width Matches As we've seen, a lookaround looks left or right but it doesn't add any characters to the match to be returned by the regex engine. Likewise, an anchor such as ^ and a boundary such as \b can match at a given position in the string, but they do not add any characters to the match.
Lookahead assertions are part of JavaScript's original regular expression support and are thus supported in all browsers.
#!/usr/bin/perl
sub test_re {
$arg = $_[0];
$INSULTSTR = $_[1];
$INSULTSTR =~ /(?:^|\.\s*)(?:(?![^.]*?$arg[^.]*\.))([^.]*\.)/;
if ($1) {
print "neg-lookahead($arg) MATCHED: '$1'\n";
} else {
print "Unable to match: neg-lookahead($arg) in '$INSULTSTR'\n";
}
}
$INSULT = 'Yomama is ugly. And, she smells like an wet dog.';
test_re('Yomama', $INSULT);
test_re('ugly', $INSULT);
test_re('looks', $INSULT);
test_re('And', $INSULT);
test_re('And,', $INSULT);
test_re('smells', $INSULT);
test_re('dog', $INSULT);
Results:
neg-lookahead(Yomama) MATCHED: 'And, she smells like an wet dog.'
neg-lookahead(ugly) MATCHED: 'And, she smells like an wet dog.'
neg-lookahead(looks) MATCHED: 'Yomama is ugly.'
neg-lookahead(And) MATCHED: 'Yomama is ugly.'
neg-lookahead(And,) MATCHED: 'Yomama is ugly.'
neg-lookahead(smells) MATCHED: 'Yomama is ugly.'
neg-lookahead(dog) MATCHED: 'Yomama is ugly.'
If you're curious about what Perl is doing with a regex, you can run with the regex debugger:
perl -Dr -e '"A two. A one." =~ /(?![A-Z][^\.]*(?:two)[^\.]*\.)([A-Z][^\.]+\.)/; print ">$1<\n"'
which will generate much output for you to ponder. You will need a Perl built with -DDEBUGGING.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With