Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regex match substring unless another substring matches

Tags:

regex

r

I'm trying to dig deeper into regexes and want to match a condition unless some substring is also found in the same string. I know I can use two grepl statements (as seen below) but am wanting to use a single regex to test for this condition as I'm pushing my understanding. Let's say I want to match the words "dog" and "man" using "(dog.*man|man.*dog)" (taken from here) but not if the string contains the substring "park". I figured I could use (*SKIP)(*FAIL) to negate the "park" but this does not cause the string to fail (shown below).

  • How can I match the logic of find "dog" & "man" but not "park" with 1 regex?
  • What is wrong with my understanding of (*SKIP)(*FAIL)|?

The code:

x <- c(
    "The dog and the man play in the park.",
    "The man plays with the dog.",
    "That is the man's hat.",
    "Man I love that dog!",
    "I'm dog tired",
    "The dog park is no place for man.",
    "Park next to this dog's man."
)

# Could do this but want one regex
grepl("(dog.*man|man.*dog)", x, ignore.case=TRUE) & !grepl("park", x, ignore.case=TRUE)

# Thought this would work, it does not
grepl("park(*SKIP)(*FAIL)|(dog.*man|man.*dog)", x, ignore.case=TRUE, perl=TRUE)
like image 318
Tyler Rinker Avatar asked Sep 23 '15 19:09

Tyler Rinker


2 Answers

You can use the anchored look-ahead solution (requiring Perl-style regexp):

grepl("^(?!.*park)(?=.*dog.*man|.*man.*dog)", x, ignore.case=TRUE, perl=T)

Here is an IDEONE demo

  • ^ - anchors the pattern at the start of the string
  • (?!.*park) - fail the match if park is present
  • (?=.*dog.*man|.*man.*dog) - fail the match if man and dog are absent.

Another version (more scalable) with 3 look-aheads:

^(?!.*park)(?=.*dog)(?=.*man)
like image 177
Wiktor Stribiżew Avatar answered Oct 13 '22 21:10

Wiktor Stribiżew


stribizhev has already answered this question as it should be approached: with a negative lookahead.

I'll contribute to this particular question:

What is wrong with my understanding of (*SKIP)(*FAIL)?

(*SKIP) and (*FAIL) are regex control verbs.

  1. (*FAIL) or (*F)
    This is the easiest to understand. (*FAIL) is exactly the same as a negative lookahead with an empty subpattern: (?!). As soon as the regex engine gets to that verb in the pattern it forces an immediate backtrack.
  2. (*SKIP) When the regex engine first encounters this verb, nothing happens, because it only acts when it's reached on backtracking. But if there is a later failure, and it reaches (*SKIP) from right to left, the backtracking can't pass (*SKIP). It causes:

    • A match failure.
    • The next match won't be attempted from the next character. Instead, it will start from the position in the text where the engine was when it reached (*SKIP).

    That is why these two control verbs are usually together as (*SKIP)(*FAIL)

Let's consider the following example:

  • Pattern: .*park(*SKIP)(*FAIL)|.*dog
  • Subject: "That park has too many dogs"
  • Matches: " has too many dog"

Internals:

  1. First attempt.
    That park has too many dogs              ||  .*park(*SKIP)(*FAIL)|.*dog
            /\                                        /\
          (here) we have a match for park
                 the engine passes (*SKIP) -no action
                 it then encounters (*FAIL) -backtrack
                 Now it reaches (*SKIP) from the right -FAIL!
  1. Second attempt.
    Normally, it should start from the second character in the subject. However, (*SKIP) has this particular behaviour. The 2nd attempt starts:
    That park has too many dogs              ||  .*park(*SKIP)(*FAIL)|.*dog
            /\                                                       /\
          (here)
          Now, there's no match for .*park
          And off course it matches .*dog

    That park has too many dogs              ||  .*park(*SKIP)(*FAIL)|.*dog
             ^               ^                                        -----
             |    (MATCH!)   |
             +---------------+

DEMO


How can I match the logic of find "dog" & "man" but not "park" with 1 regex?

Use stribizhev's solution!! Try to avoid using control verbs for the sake of compatibility, they're not implemented in all regex flavours. But if you're interested in these regex oddities, there's another stronger control verb: (*COMMIT). It is similar to (*SKIP), acting only while on backtracking, except it causes the entire match to fail (there won't be any other attempt at all). For example:

+-----------------------------------------------+
|Pattern:                                       |
|^.*park(*COMMIT)(*FAIL)|dog                    |
+-------------------------------------+---------+
|Subject                              | Matches |
+-----------------------------------------------+
|The dog and the man play in the park.|  FALSE  |
|Man I love that dog!                 |  TRUE   |
|I'm dog tired                        |  TRUE   |
|The dog park is no place for man.    |  FALSE  |
|park next to this dog's man.         |  FALSE  |
+-------------------------------------+---------+

IDEONE demo

like image 37
Mariano Avatar answered Oct 13 '22 21:10

Mariano