Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is is possible to clean a verbose python regex before printing it?

The Setup:

Let's say I have the following regex defined in my script. I want to keep the comments there for future me because I'm quite forgetful.

RE_TEST = re.compile(r"""[0-9]            # 1 Number
                         [A-Z]            # 1 Uppercase Letter
                         [a-y]            # 1 lowercase, but not z
                         z                # gotta have z...
                         """,
                     re.VERBOSE)

print(magic_function(RE_TEST))   # returns: "[0-9][A-Z][a-y]z"

The Question:

Does Python (3.4+) have a way to convert that to the simple string "[0-9][A-Z][a-y]z"?

Possible Solutions:

This question ("strip a verbose python regex") seems to be pretty close to what I'm asking for and it was answered. But that was a few years ago, so I'm wondering if a new (preferably built-in) solution has been found.

In addition to the above, there are work-arounds such as using implicit string concatenation and then using the .pattern attribute:

RE_TEST = re.compile(r"[0-9]"      # 1 Number
                     r"[A-Z]"      # 1 Uppercase Letter
                     r"[a-y]"      # 1 lowercase, but not z
                     r"z",         # gotta have z...
                     re.VERBOSE)

print(RE_TEST.pattern)    # returns: "[0-9][A-Z][a-y]z"

or just commenting the pattern separately and not compiling it:

# matches pattern "nXxz"
RE_TEST = "[0-9][A-Z][a-y]z"
print(RE_TEST)

But I'd really like to keep the compiled regex the way it is (1st example). Perhaps I'm pulling the regex string from some file, and that file is already using the verbose form.

Background

I'm asking because I want to suggest an edit to the unittest module.

Right now, if you run assertRegex(string, pattern) using a compiled pattern with comments and that assertion fails, then the printed output is somewhat ugly (the below is a dummy regex):

Traceback (most recent call last):
  File "verify_yaml.py", line 113, in test_verify_mask_names
    self.assertRegex(mask, RE_MASK)
AssertionError: Regex didn't match: '(X[1-9]X[0-9]{2})      # comment\n                         |(XXX[0-9]{2})         # comment\n                         |(XXXX[0-9E])          # comment\n                         |(XXXX[O1-9])          # c
omment\n                         |(XXX[0-9][0-9])       # comment\n                         |(XXXX[
1-9])           # comment\n                         ' not found in 'string'

I'm going to propse that the assertRegex and assertNotRegex methods clean the regex before printing it by either removing the comments and extra whitespace or by printing it differently.

like image 657
dthor Avatar asked Feb 26 '16 00:02

dthor


2 Answers

The following tested script includes a function that does a pretty good job converting an xmode regex string to non-xmode:

pcre_detidy(retext)

# Function pcre_detidy to convert xmode regex string to non-xmode.
# Rev: 20160225_1800
import re
def detidy_cb(m):
    if m.group(2): return m.group(2)
    if m.group(3): return m.group(3)
    return ""

def pcre_detidy(retext):
    decomment = re.compile(r"""(?#!py/mx decomment Rev:20160225_1800)
        # Discard whitespace, comments and the escapes of escaped spaces and hashes.
          ( (?: \s+                  # Either g1of3 $1: Stuff to discard (3 types). Either ws,
            | \#.*                   # or comments,
            | \\(?=[\r\n]|$)         # or lone escape at EOL/EOS.
            )+                       # End one or more from 3 discardables.
          )                          # End $1: Stuff to discard.
        | ( [^\[(\s#\\]+             # Or g2of3 $2: Stuff to keep. Either non-[(\s# \\.
          | \\[^# Q\r\n]             # Or escaped-anything-but: hash, space, Q or EOL.
          | \(                       # Or an open parentheses, optionally
            (?:\?\#[^)]*(?:\)|$))?   # starting a (?# Comment group).
          | \[\^?\]? [^\[\]\\]*      # Or Character class. Allow unescaped ] if first char.
            (?:\\[^Q][^\[\]\\]*)*    # {normal*} Zero or more non-[], non-escaped-Q.
            (?:                      # Begin unrolling loop {((special1|2) normal*)*}.
              (?: \[(?::\^?\w+:\])?  # Either special1: "[", optional [:POSIX:] char class.
              | \\Q       [^\\]*     # Or special2: \Q..\E literal text. Begin with \Q.
                (?:\\(?!E)[^\\]*)*   # \Q..\E contents - everything up to \E.
                (?:\\E|$)            # \Q..\E literal text ends with \E or EOL.
              )        [^\[\]\\]*    # End special: One of 2 alternatives {(special1|2)}.
              (?:\\[^Q][^\[\]\\]*)*  # More {normal*} Zero or more non-[], non-escaped-Q.
            )* (?:\]|\\?$)           # End character class with ']' or EOL (or \\EOL).
          | \\Q       [^\\]*         # Or \Q..\E literal text start delimiter.
            (?:\\(?!E)[^\\]*)*       # \Q..\E contents - everything up to \E.
            (?:\\E|$)                # \Q..\E literal text ends with \E or EOL.
          )                          # End $2: Stuff to keep.
        | \\([# ])                   # Or g3of3 $6: Escaped-[hash|space], discard the escape.
        """, re.VERBOSE | re.MULTILINE)
    return re.sub(decomment, detidy_cb, retext)

test_text = r"""
        [0-9]            # 1 Number
        [A-Z]            # 1 Uppercase Letter
        [a-y]            # 1 lowercase, but not z
        z                # gotta have z...
        """
print(pcre_detidy(test_text))

This function detidies regexes written in pcre-8/pcre2-10 xmode syntax.

It preserves whitespace inside [character classes], (?#comment groups) and \Q...\E literal text spans.

RegexTidy

The above decomment regex, is a variant of one I am using in my upcoming, yet to be released: RegexTidy application, which will not only detidy a regex as shown above (which is pretty easy to do), but it will also go the other way and Tidy a regex - i.e. convert it from non-xmode regex to xmode syntax, adding whitespace indentation to nested groups as well as adding comments (which is harder).

p.s. Before giving this answer a downvote on general principle because it uses a regex longer than a couple lines, please add a comment describing one example which is not handled correctly. Cheers!

like image 114
ridgerunner Avatar answered Nov 02 '22 04:11

ridgerunner


Looking through the way sre_parse handles this, there really isn't any point where your verbose regex gets "converted" into a regular one and then parsed. Rather, your verbose regex is being fed directly to the parser, where the presence of the VERBOSE flag makes it ignore unescaped whitespace outside character classes, and from unescaped # to end-of-line if it is not inside a character class or a capture group (which is missing from the docs).

The outcome of parsing your verbose regex there is not "[0-9][A-Z][a-y]z". Rather it is:

[(IN, [(RANGE, (48, 57))]), (IN, [(RANGE, (65, 90))]), (IN, [(RANGE, (97, 121))]), (LITERAL, 122)]

In order to do a proper job of converting your verbose regex to "[0-9][A-Z][a-y]z" you could parse it yourself. You could do this with a library like pyparsing. The other answer linked in your question uses regex, which will generally not duplicate the behavior correctly (specifically, spaces inside character classes and # inside capture groups/character classes. And even just dealing with escaping is not as convenient as with a good parser.)

like image 28
Jason S Avatar answered Nov 02 '22 05:11

Jason S