Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Negative Lookahead Regex greed (why is .*? too greedy)

Tags:

I'm having trouble understanding the finer details of negative lookahead regular expressions. After reading Regex lookahead, lookbehind and atomic groups, I thought I had a good summary of negative lookaheads when I found this description:

(?!REGEX_1)REGEX_2

Match only if REGEX_1 does not match; after checking REGEX_1, the search for REGEX_2 starts at the same position.

Hoping I understood the algorithm, I cooked up a two-sentence test insult; I wanted to find the sentence without a certain word. Specifically...

Insult: 'Yomama is ugly. And, she smells like a wet dog.'

Requirements:

  • Test 1: Return a sentence without 'ugly'.
  • Test 2: Return a sentence without 'looks'.
  • Test 3: Return a sentence without 'smells'.

I assigned the test words to $arg, and I used (?:(?![A-Z].*?$arg.*?\.))([A-Z].*?\.) to implement the test.

  • (?![A-Z].*?$arg.*?\.) is a negative lookahead to reject a sentence with the test word
  • ([A-Z].*?\.) matches at least one sentence.

The critical piece seems to be in understanding where the regex engine starts matching after processing the negative lookahead.

Expected Results:

  • Test 1 ($arg = "ugly"): "And, she smells like a wet dog."
  • Test 2 ($arg = "looks"): "Yomama is ugly."
  • Test 3 ($arg = "smells"): "Yomama is ugly."

Actual Results:

  • Test 1 ($arg = "ugly"): "And, she smells like a wet dog." (Success)
  • Test 2 ($arg = "looks"): "Yomama is ugly." (Success)
  • Test 3 ($arg = "smells"): Failed, no match

At first I thought Test 3 failed because ([A-Z].*?\.) was too greedy and matched both sentences; however, (?:(?![A-Z].*?$arg.*?\.))([A-Z][^\.]*?\.) didn't work either. Next I wondered whether there was a problem with the python negative lookahead implementation, but perl gave me exactly the same result.

Finally I found the solution, I had to reject periods in my .*? portion of the expressions by using [^\.]*?; so this regex works: (?:(?![A-Z][^\.]*?$arg[^\.]*?\.))([A-Z][^\.]*?\.)

Question

However, I have another concern; "Yomama is ugly." does not have "smells" in it. So, if .*? is supposed to be a non-greedy match, why can't I complete Test 3 with (?:(?![A-Z].*?$arg.*?\.))([A-Z].*?\.)?

EDIT

In light of @bvr's excellent suggestion to use -Mre=debug, I will consider this some more after work. It certainly looks like Seth's description is accurate at this point. What I learned so far is that negative lookahead expressions will match whenever possible, even if I put non-greedy .*? operators in the NLA.


Python Implementation

import re

def test_re(arg, INSULTSTR):
    mm = re.search(r'''
        (?:                  # No grouping
        (?![A-Z].*?%s.*?\.)) # Negative zero-width
                             #     assertion: arg, followed by a period
        ([A-Z].*?\.)         # Match a capital letter followed by a period
        ''' % arg, INSULTSTR, re.VERBOSE)
    if mm is not None:
        print "neg-lookahead(%s) MATCHED: '%s'" % (arg, mm.group(1))
    else:
        print "Unable to match: neg-lookahead(%s) in '%s'" % (arg, INSULTSTR)


INSULT = 'Yomama is ugly.  And, she smells like a wet dog.'
test_re('ugly', INSULT)
test_re('looks', INSULT)
test_re('smells', INSULT)

Perl Implementation

#!/usr/bin/perl

sub test_re {
    $arg    = $_[0];
    $INSULTSTR = $_[1];
    $INSULTSTR =~ /(?:(?![A-Z].*?$arg.*?\.))([A-Z].*?\.)/;
    if ($1) {
        print "neg-lookahead($arg) MATCHED: '$1'\n";
    } else {
        print "Unable to match: neg-lookahead($arg) in '$INSULTSTR'\n";
    }
}

$INSULT = 'Yomama is ugly.  And, she smells like a wet dog.';
test_re('ugly', $INSULT);
test_re('looks', $INSULT);
test_re('smells', $INSULT);

Output

neg-lookahead(ugly) MATCHED: 'And, she smells like a wet dog.'
neg-lookahead(looks) MATCHED: 'Yomama is ugly.'
Unable to match: neg-lookahead(smells) in 'Yomama is ugly.  And, she smells like a wet dog.'
like image 441
Mike Pennington Avatar asked May 25 '11 03:05

Mike Pennington


People also ask

What is negative lookahead in regex?

Because the lookahead is negative, this means that the lookahead has successfully matched at the current position. At this point, the entire regex has matched, and q is returned as the match.

What is Lookbehind in regex?

Lookbehind, which is used to match a phrase that is preceded by a user specified text. Positive lookbehind is syntaxed like (? <=a)something which can be used along with any regex parameter. The above phrase matches any "something" word that is preceded by an "a" word.

What are Lookarounds in regex?

Zero-Width Matches As we've seen, a lookaround looks left or right but it doesn't add any characters to the match to be returned by the regex engine. Likewise, an anchor such as ^ and a boundary such as \b can match at a given position in the string, but they do not add any characters to the match.

Can I use regex lookahead?

Lookahead assertions are part of JavaScript's original regular expression support and are thus supported in all browsers.


2 Answers

#!/usr/bin/perl

sub test_re {
    $arg    = $_[0];
    $INSULTSTR = $_[1];
    $INSULTSTR =~ /(?:^|\.\s*)(?:(?![^.]*?$arg[^.]*\.))([^.]*\.)/;
    if ($1) {
        print "neg-lookahead($arg) MATCHED: '$1'\n";
    } else {
        print "Unable to match: neg-lookahead($arg) in '$INSULTSTR'\n";
    }
}

$INSULT = 'Yomama is ugly.  And, she smells like an wet dog.';
test_re('Yomama', $INSULT);
test_re('ugly', $INSULT);
test_re('looks', $INSULT);
test_re('And', $INSULT);
test_re('And,', $INSULT);
test_re('smells', $INSULT);
test_re('dog', $INSULT);

Results:

neg-lookahead(Yomama) MATCHED: 'And, she smells like an wet dog.'
neg-lookahead(ugly) MATCHED: 'And, she smells like an wet dog.'
neg-lookahead(looks) MATCHED: 'Yomama is ugly.'
neg-lookahead(And) MATCHED: 'Yomama is ugly.'
neg-lookahead(And,) MATCHED: 'Yomama is ugly.'
neg-lookahead(smells) MATCHED: 'Yomama is ugly.'
neg-lookahead(dog) MATCHED: 'Yomama is ugly.'
like image 97
Seth Robertson Avatar answered Oct 09 '22 21:10

Seth Robertson


If you're curious about what Perl is doing with a regex, you can run with the regex debugger:

perl -Dr -e '"A two. A one." =~ /(?![A-Z][^\.]*(?:two)[^\.]*\.)([A-Z][^\.]+\.)/; print ">$1<\n"'

which will generate much output for you to ponder. You will need a Perl built with -DDEBUGGING.

like image 35
Alex Avatar answered Oct 09 '22 20:10

Alex