Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex that matches xa?b?c? but not x alone

Tags:

regex

I'm trying to write a regex that matches xa?b?c? but not x. In reality, 'x', 'a', 'b', and 'c' are not single characters, they are moderately complex sub-expressions, so I'm trying to avoid something like x(abc|ab|ac|bc|a|b|c). Is there a simple way to match "at least one of a, b, and c, in that order" in a regex, or am I out of luck?

like image 357
So8res Avatar asked Nov 12 '10 20:11

So8res


4 Answers

Here’s the shortest version:

(a)?(b)?(c)?(?(1)|(?(2)|(?(3)|(*FAIL))))

If you need to keep around the match in a separate group, write this:

((a)?(b)?(c)?)(?(2)|(?(3)|(?(4)|(*FAIL))))

But that isn’t very robust in case a, b, or c contain capture groups. So instead write this:

(?<A>a)?(?<B>b)?(?<C>c)?(?(<A>)|(?(<B>)|(?(<C>)|(*FAIL))))

And if you need a group for the whole match, then write this:

(?<M>(?<A>a)?(?<B>b)?(?<C>c)?(?(<A>)|(?(<B>)|(?(<C>)|(*FAIL)))))

And if like me you prefer multi-lettered identifiers and also think this sort of thing is insane without being in /x mode, write this:

(?x)
(?<Whole_Match>
    (?<Group_A> a) ?
    (?<Group_B> b) ?  
    (?<Group_C> c) ?

    (?(<Group_A>)           # Succeed 
      | (?(<Group_B>)       # Succeed
          | (?(<Group_C>)   # Succeed
              |             (*FAIL)
            )
        )
    )
 )

And here is the full testing program to prove that those all work:

#!/usr/bin/perl
use 5.010_000;

my @pats = (
    qr/(a)?(b)?(c)?(?(1)|(?(2)|(?(3)|(*FAIL))))/,
    qr/((a)?(b)?(c)?)(?(2)|(?(3)|(?(4)|(*FAIL))))/,
    qr/(?<A>a)?(?<B>b)?(?<C>c)?(?(<A>)|(?(<B>)|(?(<C>)|(*FAIL))))/,
    qr/(?<M>(?<A>a)?(?<B>b)?(?<C>c)?(?(<A>)|(?(<B>)|(?(<C>)|(*FAIL)))))/,
    qr{
        (?<Whole_Match>

            (?<Group_A> a) ?
            (?<Group_B> b) ?
            (?<Group_C> c) ?

            (?(<Group_A>)               # Succeed
              | (?(<Group_B>)           # Succeed
                  | (?(<Group_C>)       # Succeed
                      |                 (*FAIL)
                    )
                )
            )

        )
    }x,
);

for my $pat (@pats) {
    say "\nTESTING $pat";
    $_ = "i can match bad crabcatchers from 34 bc and call a cab";
    while (/$pat/g) {
        say "$`<$&>$'";
    }
}

All five versions produce this output:

i <c>an match bad crabcatchers from 34 bc and call a cab
i c<a>n match bad crabcatchers from 34 bc and call a cab
i can m<a>tch bad crabcatchers from 34 bc and call a cab
i can mat<c>h bad crabcatchers from 34 bc and call a cab
i can match <b>ad crabcatchers from 34 bc and call a cab
i can match b<a>d crabcatchers from 34 bc and call a cab
i can match bad <c>rabcatchers from 34 bc and call a cab
i can match bad cr<abc>atchers from 34 bc and call a cab
i can match bad crabc<a>tchers from 34 bc and call a cab
i can match bad crabcat<c>hers from 34 bc and call a cab
i can match bad crabcatchers from 34 <bc> and call a cab
i can match bad crabcatchers from 34 bc <a>nd call a cab
i can match bad crabcatchers from 34 bc and <c>all a cab
i can match bad crabcatchers from 34 bc and c<a>ll a cab
i can match bad crabcatchers from 34 bc and call <a> cab
i can match bad crabcatchers from 34 bc and call a <c>ab
i can match bad crabcatchers from 34 bc and call a c<ab>

Sweet, eh?

EDIT: For the x in the beginning part, just put whatever x you want at the start of the match, before the very first optional capture group for the a part, so like this:

x(a)?(b)?(c)?(?(1)|(?(2)|(?(3)|(*FAIL))))

or like this

(?x)                        # enable non-insane mode

(?<Whole_Match>
    x                       # first match some leader string

    # now match a, b, and c, in that order, and each optional
    (?<Group_A> a ) ?
    (?<Group_B> b ) ?  
    (?<Group_C> c ) ?

    # now make sure we got at least one of a, b, or c
    (?(<Group_A>)           # SUCCEED!
      | (?(<Group_B>)       # SUCCEED!
          | (?(<Group_C>)   # SUCCEED!
              |             (*FAIL)
            )
        )
    )
)

The test sentence was constructed without the x part, so it won’t work for that, but I think I’ve shown how I mean to go at this. Note that all of x, a, b, and c can be arbitrarily complex patterns (yes, even recursive), not merely single letters, and it doesn’t matter if they use numbered capture groups of their own, even.

If you want to go at this with lookaheads, you can do this:

(?x)

(?(DEFINE)
       (?<Group_A> a)
       (?<Group_B> b)
       (?<Group_C> c)
)

x

(?= (?&Group_A)
  | (?&Group_B)
  | (?&Group_C)
)

(?&Group_A) ?
(?&Group_B) ?
(?&Group_C) ?

And here is what to add to the @pats array in the test program to show that this approach also works:

qr{
    (?(DEFINE)
        (?<Group_A> a)
        (?<Group_B> b)
        (?<Group_C> c)
    )

    (?= (?&Group_A)
      | (?&Group_B)
      | (?&Group_C)
    )

    (?&Group_A) ?
    (?&Group_B) ?
    (?&Group_C) ?
}x

You’ll notice please that I still manage never to repeat any of a, b, or c, even with the lookahead technique.

Do I win? ☺

like image 188
tchrist Avatar answered Sep 20 '22 09:09

tchrist


Not trivially, if you don't have lookahead.

x(ab?c?|bc?|c)
like image 37
Ignacio Vazquez-Abrams Avatar answered Sep 21 '22 09:09

Ignacio Vazquez-Abrams


How about this:

x(?:a())?(?:b())?(?:c())?(\1|\2|\3)

The empty capturing groups after a, b and c will always match (an empty string) if a, b or c match, in that order.

The (\1|\2|\3) part will only match if at least one of the preceding groups participated in the match. So if you just have x, the regex fails.

Every part of the regex will be evaluated just once.

Of course, if x, a, b and c are more complex subexpressions that contain capturing groups themselves, you have to adjust the numbers of the backreferences accordingly*.

Since this regex does look a bit strange, here's the verbose version:

x          # Match x
(?:a())?   # Try to match a. If this succeeds, \1 will contain an empty string.
(?:b())?   # Same with b and \2.
(?:c())?   # Same with c and \3.
(\1|\2|\3) # Now try to match the content of one of the backreferences. 
           # This works if one of the empty parentheses participated in the match.
           # If so, the backref contains an empty string which always matches. 
           # Bingo!

You might need to surround this with anchors (^ and $) unless you don't mind it matching xb within the string cxba etc.

For example, in Python:

>>> r=re.compile(r"x(?:a())?(?:b())?(?:c())?(\1|\2|\3)$")
>>> for test in ("x", "xa", "xabc", "xba"):
...     m = r.match(test)
...     if m:
...         print("{} --> {}".format(test, m.group(0)))
...     else:
...         print("{} --> no match".format(test))
...
x --> no match
xa --> xa
xabc --> xabc
xba --> no match

*or, if your regex flavor knows named capturing groups, you can use those, for example

x(?:a(?P<a>))?(?:b(?P<b>))?(?:c(?P<c>))?((?P=a)|(?P=b)|(?P=c))

in Python/PCRE. In .NET (and possibly other flavors), it's even legal to have several capturing groups that use the same name, making another simplification possible:

x(?:a(?<m>))?(?:b(?<m>))?(?:c(?<m>))?\k<m>
like image 35
Tim Pietzcker Avatar answered Sep 22 '22 09:09

Tim Pietzcker


How about something like

x(?=[abc])a?b?c?
like image 32
Blindy Avatar answered Sep 21 '22 09:09

Blindy