Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex: Matching against groups in different order without repeating the group

Tags:

regex

Let's say I have two strings like this:

XABY
XBAY

A simple regex that matches both would go like this:

X(AB|BA)Y

However, I have a case where A and B are complicated strings, and I'm looking for a way to avoid having to specify each of them twice (on each side of the |). Is there a way to do this (that presumably is simpler than having to specify them twice)?

Thanks

like image 724
Jimmy Avatar asked Apr 08 '10 00:04

Jimmy


People also ask

How do you write a non-capturing group in regex?

Sometimes you want to use parentheses to group parts of an expression together, but you don't want the group to capture anything from the substring it matches. To do this use (?: and ) to enclose the group. matches dollar amounts like $10.43 and USD19.

What does regex (? S match?

The plus sign + is a greedy quantifier, which means one or more times. For example, expression X+ matches one or more X characters. Therefore, the regular expression \s matches a single whitespace character, while \s+ will match one or more whitespace characters.

What is the difference between a match and group in regex?

A Match is an object that indicates a particular regular expression matched (a portion of) the target text. A Group indicates a portion of a match, if the original regular expression contained group markers (basically a pattern in parentheses).


1 Answers

X(?:A()|B()){2}\1\2Y

Basically, you use an empty capturing group to check off each item when it's matched, then the back-references ensure that everything's been checked off.

Be aware that this relies on undocumented regex behavior, so there's no guarantee that it will work in your regex flavor--and if it does, there's no guarantee that it will continue to work as that flavor evolves. But as far as I know, it works in every flavor that supports back-references. (EDIT: It does not work in JavaScript.)

EDIT: You say you're using named groups to capture parts of the match, which adds a lot of visual clutter to the regex, if not real complexity. Well, if you happen to be using .NET regexes, you can still use simple numbered groups for the "check boxes". Here's a simplistic example that finds and picks apart a bunch of month-day strings without knowing their internal order:

  Regex r = new Regex(
    @"(?:
        (?<MONTH>Jan|Feb|Mar|Apr|May|Jun|Jul|Sep|Oct|Nov|Dec)()
        |
        (?<DAY>\d+)()
      ){2}
      \1\2",
    RegexOptions.IgnorePatternWhitespace);

  string input = @"30Jan Feb12 Mar23 4Apr May09 11Jun";
  foreach (Match m in r.Matches(input))
  {
    Console.WriteLine("{0} {1}", m.Groups["MONTH"], m.Groups["DAY"]);
  }

This works because in .NET, the presence of named groups has no effect on the ordering of the non-named groups. Named groups have numbers assigned to them, but those numbers start after the last of the non-named groups. (I know that seems gratuitously complicated, but there are good reasons for doing it that way.)

Normally you want to avoid using named and non-named capturing groups together, especially if you're using back-references, but I think this case could be a legitimate exception.

like image 176
Alan Moore Avatar answered Oct 04 '22 14:10

Alan Moore