Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

strip a verbose python regex

Tags:

python

regex

I have a verbose python regex string (with lots of whitespace and comments) that I'd like to convert to "normal" style (for export to javascript). In particular, I need this to be quite reliable. If there's any demonstrably correct way to do this, it's what I want. For example, a naive implementation would destroy a regex like r' \# # A literal hash character', which is not OK.

The best way to do this would be to coerce the python re module to give me back a non-verbose representation of my regex, but I don't see a way to do that.

like image 895
bukzor Avatar asked Feb 14 '13 22:02

bukzor


1 Answers

I believe you only need to address these two issues to strip a verbose regex:

  1. delete comments to the end of line
  2. delete unescaped whitespace

try this, which chains the 2 with separate regex substitutions:

import re

def unverbosify_regex_simple(verbose):
    WS_RX = r'(?<!\\)((\\{2})*)\s+'
    CM_RX = r'(?<!\\)((\\{2})*)#.*$(?m)'

    return re.sub(WS_RX, "\\1", re.sub(CM_RX, "\\1", verbose))

The above is a simplified version that leaves escaped spaces as-is. The resulting output will be a little harder to read but should work for regex platforms.

Alternatively, for a slightly more complex answer that "unescapes" spaces (i.e., '\ ' => ' ') and returns what I think most people would expect:

import re

def unverbosify_regex(verbose):
    CM1_RX = r'(?<!\\)((\\{2})*)#.*$(?m)'
    CM2_RX = r'(\\)?((\\{2})*)(#)'
    WS_RX  = r'(\\)?((\\{2})*)(\s)\s*'

    def strip_escapes(match):
        ## if even slashes: delete space and retain slashes
        if match.group(1) is None:
            return match.group(2)

        ## if number of slashes is odd: delete slash and keep space (or 'comment')
        elif match.group(1) == '\\':
            return match.group(2) + match.group(4)

        ## error
        else:
            raise Exception

    not_verbose_regex = re.sub(WS_RX, strip_escapes,
                          re.sub(CM2_RX, strip_escapes,
                            re.sub(CM1_RX, "\\1", verbose)))

    return not_verbose_regex

UPDATE: added comments to explain even v. odd slash counting. Fixed first group in CM_RX to retain full 'comment' if slash count is odd.

UPDATE 2: Fixed comments regex, which was not dealing with escaped hashes properly. Should handle both "\# #escaped hash" as well as "# comment with \# escaped hash" and "\\# comment"

UPDATE 3: Added a simplified version that doesn't clean up escaped spaces.

UPDATE 4: Further simplification to eliminate variable-length negative lookbehind (and reverse/reverse trick)

like image 144
dpkp Avatar answered Sep 28 '22 13:09

dpkp