I'm trying to dig deeper into regexes and want to match a condition unless some substring is also found in the same string. I know I can use two grepl
statements (as seen below) but am wanting to use a single regex to test for this condition as I'm pushing my understanding. Let's say I want to match the words "dog" and "man" using "(dog.*man|man.*dog)"
(taken from here) but not if the string contains the substring "park". I figured I could use (*SKIP)(*FAIL)
to negate the "park" but this does not cause the string to fail (shown below).
(*SKIP)(*FAIL)|
?The code:
x <- c(
"The dog and the man play in the park.",
"The man plays with the dog.",
"That is the man's hat.",
"Man I love that dog!",
"I'm dog tired",
"The dog park is no place for man.",
"Park next to this dog's man."
)
# Could do this but want one regex
grepl("(dog.*man|man.*dog)", x, ignore.case=TRUE) & !grepl("park", x, ignore.case=TRUE)
# Thought this would work, it does not
grepl("park(*SKIP)(*FAIL)|(dog.*man|man.*dog)", x, ignore.case=TRUE, perl=TRUE)
You can use the anchored look-ahead solution (requiring Perl-style regexp):
grepl("^(?!.*park)(?=.*dog.*man|.*man.*dog)", x, ignore.case=TRUE, perl=T)
Here is an IDEONE demo
^
- anchors the pattern at the start of the string(?!.*park)
- fail the match if park
is present(?=.*dog.*man|.*man.*dog)
- fail the match if man
and dog
are absent.Another version (more scalable) with 3 look-aheads:
^(?!.*park)(?=.*dog)(?=.*man)
stribizhev has already answered this question as it should be approached: with a negative lookahead.
I'll contribute to this particular question:
What is wrong with my understanding of
(*SKIP)(*FAIL)
?
(*SKIP)
and (*FAIL)
are regex control verbs.
(*FAIL)
or (*F)
(*FAIL)
is exactly the same as a negative lookahead with an empty subpattern: (?!)
. As soon as the regex engine gets to that verb in the pattern it forces an immediate backtrack.(*SKIP)
When the regex engine first encounters this verb, nothing happens, because it only acts when it's reached on backtracking. But if there is a later failure, and it reaches (*SKIP)
from right to left, the backtracking can't pass (*SKIP)
. It causes:
(*SKIP)
.That is why these two control verbs are usually together as (*SKIP)(*FAIL)
Let's consider the following example:
.*park(*SKIP)(*FAIL)|.*dog
"That park has too many dogs"
" has too many dog"
Internals:
That park has too many dogs || .*park(*SKIP)(*FAIL)|.*dog
/\ /\
(here) we have a match for park
the engine passes (*SKIP) -no action
it then encounters (*FAIL) -backtrack
Now it reaches (*SKIP) from the right -FAIL!
(*SKIP)
has this particular behaviour. The 2nd attempt starts: That park has too many dogs || .*park(*SKIP)(*FAIL)|.*dog
/\ /\
(here)
Now, there's no match for .*park
And off course it matches .*dog
That park has too many dogs || .*park(*SKIP)(*FAIL)|.*dog
^ ^ -----
| (MATCH!) |
+---------------+
DEMO
How can I match the logic of find "dog" & "man" but not "park" with 1 regex?
Use stribizhev's solution!! Try to avoid using control verbs for the sake of compatibility, they're not implemented in all regex flavours. But if you're interested in these regex oddities, there's another stronger control verb: (*COMMIT)
. It is similar to (*SKIP)
, acting only while on backtracking, except it causes the entire match to fail (there won't be any other attempt at all). For example:
+-----------------------------------------------+
|Pattern: |
|^.*park(*COMMIT)(*FAIL)|dog |
+-------------------------------------+---------+
|Subject | Matches |
+-----------------------------------------------+
|The dog and the man play in the park.| FALSE |
|Man I love that dog! | TRUE |
|I'm dog tired | TRUE |
|The dog park is no place for man. | FALSE |
|park next to this dog's man. | FALSE |
+-------------------------------------+---------+
IDEONE demo
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With