Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex: negative look-ahead between two matches

I'm trying to build a regex somewhat like this:

[match-word] ... [exclude-specific-word] ... [match-word]

This seems to work with a negative look-ahead, but I'm running into a problem when I have a case like this:

[match-word] ... [exclude-specific-word] ... [match-word] ... [excluded word appears again]

I want the above sentence to match, but the negative look-ahead between the first and the second matched word "spills over" so the second word is never matched.

Let's look at a practical example.

I wan't to match every sentence which has the word "i" and the word "pie", but not the word "hate" in between those two words. I have these three sentences:

i sure like eating pie, but i love donuts <- Want to match this
i sure like eating pie, but i hate donuts <- Want to match this
i sure hate eating pie, but i like donuts <- Don't want to match this

I have this regex:

^i(?!.*hate).*pie          - have removed the word boundaries for clarity, original is: ^i\b(?!.*\bhate\b).*\bpie\b 

Which matches the first sentence, but not the second one, because the negative look-ahead scans the whole string.

Is there a way to limit the negative look-ahead, so that it's satisfied if it encounters "pie" before it encounters "hate"?

Note: in my implementation, there may be other terms following this regex (it's built dynamically from a grammar search engine), for instance:

^i(?!.*hate).*pie.*donuts

I'm currently using JRegex, but could probably switch to JDK Regex if necessary

Update: I forgot to mention something in my initial question:

It's possible that the "negative construct" exists further in the sentence, and I do want to match the sentence if it's possible even if the "negative" construct exists further up.

To clarify, look at these sentences:

i sure like eating pie, but i love donuts <- Want to match this
i sure like eating pie, but i hate donuts <- Want to match this
i sure hate eating pie, but i like donuts <- Don't want to match this
i sure like eating pie, but i like donuts and i hate making pie <- Do want to match this

rob's answer works perfectly for this extra constraint, so I'm accepting that one.

like image 544
Alexander Malfait Avatar asked Mar 23 '12 17:03

Alexander Malfait


3 Answers

At every character between your start and stop words, you have to make sure that it doesn't match your negative or stop word. Like this (where I've included a little white space for readability):

^i ( (?!hate|pie) . )* pie

Here's a python program to test things.

import re

test = [ ('i sure like eating pie, but i love donuts', True),
         ('i sure like eating pie, but i hate donuts', True),
         ('i sure hate eating pie, but i like donuts', False) ]

rx = re.compile(r"^i ((?!hate|pie).)* pie", re.X)

for t,v in test:
    m = rx.match(t)
    print t, "pass" if bool(m) == v else "fail"
like image 58
rob Avatar answered Oct 20 '22 01:10

rob


This regex should work for you

^(?!i.*hate.*pie)i.*pie.*donuts

Explanation

"^" +          // Assert position at the beginning of a line (at beginning of the string or after a line break character)
"(?!" +        // Assert that it is impossible to match the regex below starting at this position (negative lookahead)
   "i" +          // Match the character “i” literally
   "." +          // Match any single character that is not a line break character
      "*" +          // Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   "hate" +       // Match the characters “hate” literally
   "." +          // Match any single character that is not a line break character
      "*" +          // Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   "pie" +        // Match the characters “pie” literally
")" +
"i" +          // Match the character “i” literally
"." +          // Match any single character that is not a line break character
   "*" +          // Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
"pie" +        // Match the characters “pie” literally
"." +          // Match any single character that is not a line break character
   "*" +          // Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
"donuts"       // Match the characters “donuts” literally
like image 32
Narendra Yadala Avatar answered Oct 20 '22 02:10

Narendra Yadala


To match no C between ...A...B...

Test in python:

$ python
>>> import re
>>> re.match(r'.*A(?!.*C.*B).*B', 'C A x B C')
<_sre.SRE_Match object at 0x94ab7c8>

So I get this regex:

.*\bi\b(?!.*hate.*pie).*pie
like image 2
kev Avatar answered Oct 20 '22 01:10

kev