Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to define custom shorthands in regular expressions?

Tags:

regex

I have a regular expression of the form

def parse(self, format_string):
    for m in re.finditer(
        r"""(?: \$ \( ( [^)]+ ) \) )   # the field access specifier
          | (
                (?:
                    \n | . (?= \$ \( ) # any one single character before the '$('
                )
              | (?:
                    \n | . (?! \$ \( ) # any one single character, except the one before the '$('
                )*
            )""",
        format_string,
        re.VERBOSE):
    ...

I would like to replace all the repeating sequences (\$ \() with some custom shorthand "constant", which would look like this:

def parse(self, format_string):
    re.<something>('\BEGIN = \$\(')
    for m in re.finditer(
        r"""(?: \BEGIN ( [^)]+ ) \) )   # the field access specifier
          | (
                (?:
                    \n | . (?= \BEGIN ) # any one single character before the '$('
                )
              | (?:
                    \n | . (?! \BEGIN ) # any one single character, except the one before the '$('
                )*
            )""",
        format_string,
        re.VERBOSE):
    ...

Is there a way to do this with regular expressions themselves (i.e. not using Python's string formatting to substitute \BEGIN with \$\()?

Clarification: the Python source is purely for context and illustration. I'm looking for RE solution, which would be available in some RE dialect (maybe not Python's one), not the solution specifically for Python.

like image 321
Michael Pankov Avatar asked Aug 09 '13 15:08

Michael Pankov


1 Answers

I don't think this is possible in Python's regex flavor. You would need recursion (or rather pattern reuse) which is only supported by PCRE. In fact, PCRE even mentions how defining shorthands works in its man page (search for "Defining subpatterns").

In PCRE, you can use the recursion syntax in a similar way to backreferences - except that the pattern is applied again, instead of trying to get the same literal text as from a backreference. Example:

/(\d\d)-(?1)-(?1)/

Matches something like a date (where (?1) will be replaced with with \d\d and evaluated again). This is really powerful, because if you use this construct within the referenced group itself you get recursion - but we don't even need that here. The above also works with named groups:

/(?<my>\d\d)-(?&my)-(?&my)/

Now we're already really close, but the definition is also the first use of the pattern, which somewhat clutters up the expression. The trick is to use the pattern first in a position that is never evaluated. The man pages suggest a conditional that is dependent on a (non-existent) group DEFINE:

/
(?(DEFINE)
  (?<my>\d\d)
)
(?&my)-(?&my)-(?&my)
/x

The construct (?(group)true|false) applies pattern true if the group group was used before, and (the optional) pattern false otherwise. Since there is no group DEFINE, the condition will always be false, and the true pattern will be skipped. Hence, we can put all kinds of definitions there, without being afraid that they will ever be applied and mess up our results. This way we get them into the pattern, without really using them.

And alternative is a negative lookahead that never reaches the point where the expression is defined:

/
(?!
  (?!)     # fail - this makes the surrounding lookahead pass unconditionally
  # the engine never gets here; now we can write down our definitions
  (?<my>\d\d) 
)
(?&my)-(?&my)-(?&my)
/x

However, you only really need this form, if you don't have conditionals, but do have named pattern reuse (and I don't think a flavor like this exists). The other variant has the advantage, that the use of DEFINE makes it obvious what the group is for, while the lookahead approach is a bit obfuscated.

So back to your original example:

/
# Definitions
(?(DEFINE)
  (?<BEGIN>[$][(])
)
# And now your pattern
  (?: (?&BEGIN) ( [^)]+ ) \) ) # the field access specifier
|
  (
    (?: # any one single character before the '$('
      \n | . (?= (?&BEGIN) ) 
    )
  | 
    (?: # any one single character, except the one before the '$('
      \n | . (?! (?&BEGIN) ) 
    )*
  )
/x

There are two major caveats to this approach:

  1. Recursive references are atomic. That is, once the reference has matched something it will never be backtracked into. For certain cases this can mean that you have to be a bit clever in crafting your expression, so that the first match will always be the one you want.
  2. You cannot use capturing inside the defined patterns. If you use something like (?<myPattern>a(b)c) and reuse it, the b will never be captured - when reusing a pattern, all groups are non-capturing.

The most important advantage over any kind of interpolation or concatenation however is, that you can never produce invalid patterns with this, and you cannot mess up your capturing group counts either.

like image 78
Martin Ender Avatar answered Sep 21 '22 16:09

Martin Ender