I'm trying to build a regex somewhat like this:
[match-word] ... [exclude-specific-word] ... [match-word]
This seems to work with a negative look-ahead, but I'm running into a problem when I have a case like this:
[match-word] ... [exclude-specific-word] ... [match-word] ... [excluded word appears again]
I want the above sentence to match, but the negative look-ahead between the first and the second matched word "spills over" so the second word is never matched.
Let's look at a practical example.
I wan't to match every sentence which has the word "i" and the word "pie", but not the word "hate" in between those two words. I have these three sentences:
i sure like eating pie, but i love donuts <- Want to match this
i sure like eating pie, but i hate donuts <- Want to match this
i sure hate eating pie, but i like donuts <- Don't want to match this
I have this regex:
^i(?!.*hate).*pie - have removed the word boundaries for clarity, original is: ^i\b(?!.*\bhate\b).*\bpie\b
Which matches the first sentence, but not the second one, because the negative look-ahead scans the whole string.
Is there a way to limit the negative look-ahead, so that it's satisfied if it encounters "pie" before it encounters "hate"?
Note: in my implementation, there may be other terms following this regex (it's built dynamically from a grammar search engine), for instance:
^i(?!.*hate).*pie.*donuts
I'm currently using JRegex, but could probably switch to JDK Regex if necessary
Update: I forgot to mention something in my initial question:
It's possible that the "negative construct" exists further in the sentence, and I do want to match the sentence if it's possible even if the "negative" construct exists further up.
To clarify, look at these sentences:
i sure like eating pie, but i love donuts <- Want to match this
i sure like eating pie, but i hate donuts <- Want to match this
i sure hate eating pie, but i like donuts <- Don't want to match this
i sure like eating pie, but i like donuts and i hate making pie <- Do want to match this
rob's answer works perfectly for this extra constraint, so I'm accepting that one.
At every character between your start and stop words, you have to make sure that it doesn't match your negative or stop word. Like this (where I've included a little white space for readability):
^i ( (?!hate|pie) . )* pie
Here's a python program to test things.
import re
test = [ ('i sure like eating pie, but i love donuts', True),
('i sure like eating pie, but i hate donuts', True),
('i sure hate eating pie, but i like donuts', False) ]
rx = re.compile(r"^i ((?!hate|pie).)* pie", re.X)
for t,v in test:
m = rx.match(t)
print t, "pass" if bool(m) == v else "fail"
This regex should work for you
^(?!i.*hate.*pie)i.*pie.*donuts
Explanation
"^" + // Assert position at the beginning of a line (at beginning of the string or after a line break character)
"(?!" + // Assert that it is impossible to match the regex below starting at this position (negative lookahead)
"i" + // Match the character “i” literally
"." + // Match any single character that is not a line break character
"*" + // Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
"hate" + // Match the characters “hate” literally
"." + // Match any single character that is not a line break character
"*" + // Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
"pie" + // Match the characters “pie” literally
")" +
"i" + // Match the character “i” literally
"." + // Match any single character that is not a line break character
"*" + // Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
"pie" + // Match the characters “pie” literally
"." + // Match any single character that is not a line break character
"*" + // Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
"donuts" // Match the characters “donuts” literally
To match no C
between ...A...B...
Test in python
:
$ python
>>> import re
>>> re.match(r'.*A(?!.*C.*B).*B', 'C A x B C')
<_sre.SRE_Match object at 0x94ab7c8>
So I get this regex:
.*\bi\b(?!.*hate.*pie).*pie
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With