I have a verbose python regex string (with lots of whitespace and comments) that I'd like to convert to "normal" style (for export to javascript). In particular, I need this to be quite reliable. If there's any demonstrably correct way to do this, it's what I want. For example, a naive implementation would destroy a regex like r' \# # A literal hash character'
, which is not OK.
The best way to do this would be to coerce the python re module to give me back a non-verbose representation of my regex, but I don't see a way to do that.
I believe you only need to address these two issues to strip a verbose regex:
try this, which chains the 2 with separate regex substitutions:
import re
def unverbosify_regex_simple(verbose):
WS_RX = r'(?<!\\)((\\{2})*)\s+'
CM_RX = r'(?<!\\)((\\{2})*)#.*$(?m)'
return re.sub(WS_RX, "\\1", re.sub(CM_RX, "\\1", verbose))
The above is a simplified version that leaves escaped spaces as-is. The resulting output will be a little harder to read but should work for regex platforms.
Alternatively, for a slightly more complex answer that "unescapes" spaces (i.e., '\ ' => ' ') and returns what I think most people would expect:
import re
def unverbosify_regex(verbose):
CM1_RX = r'(?<!\\)((\\{2})*)#.*$(?m)'
CM2_RX = r'(\\)?((\\{2})*)(#)'
WS_RX = r'(\\)?((\\{2})*)(\s)\s*'
def strip_escapes(match):
## if even slashes: delete space and retain slashes
if match.group(1) is None:
return match.group(2)
## if number of slashes is odd: delete slash and keep space (or 'comment')
elif match.group(1) == '\\':
return match.group(2) + match.group(4)
## error
else:
raise Exception
not_verbose_regex = re.sub(WS_RX, strip_escapes,
re.sub(CM2_RX, strip_escapes,
re.sub(CM1_RX, "\\1", verbose)))
return not_verbose_regex
UPDATE: added comments to explain even v. odd slash counting. Fixed first group in CM_RX to retain full 'comment' if slash count is odd.
UPDATE 2: Fixed comments regex, which was not dealing with escaped hashes properly. Should handle both "\# #escaped hash" as well as "# comment with \# escaped hash" and "\\# comment"
UPDATE 3: Added a simplified version that doesn't clean up escaped spaces.
UPDATE 4: Further simplification to eliminate variable-length negative lookbehind (and reverse/reverse trick)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With