I would like to parse codetags in source files. I wrote this regex that works fine with PCRE:
(?<tag>(?&TAG)):\s*
(?<message>.*?)
(
<
(?<author>(?:\w{3}\s*,\s*)*\w{3})?\s*
(?<date>(?&DATE))?
(?<flags>(?&FLAGS))?
>
)?
$
(?(DEFINE)
(?<TAG>\b(NOTE|LEGACY|HACK|TODO|FIXME|XXX|BUG))
(?<DATE>\d{4}-\d{2}-\d{2})
(?<FLAGS>[pts]:\w+\b)
)
Unfortunately it seems Python doesn't understand the DEFINE (https://regex101.com/r/qH1uG3/1#pcre)
What is the best workaround in Python?
The way with the regex module:
As explained in comments the regex module allows to reuse named subpatterns. Unfortunately there is no (?(DEFINE)...)
syntax like in Perl or PCRE.
So the way is to use the same workaround than with Ruby language that consists to put a {0}
quantifier when you want to define a named subpattern:
import regex
s = r'''
// NOTE: A small example
// HACK: Another example <ABC 2014-02-03>
// HACK: Another example <ABC,DEF 2014-02-03>
// HACK: Another example <ABC,DEF p:0>
'''
p = r'''
# subpattern definitions
(?<TAG> \b(?:NOTE|LEGACY|HACK|TODO|FIXME|XXX|BUG) ){0}
(?<DATE> \d{4}-\d{2}-\d{2} ){0}
(?<FLAGS> [pts]:\w+ ){0}
# main pattern
(?<tag> (?&TAG) ) : \s*
(?<message> (?>[^\s<]+[^\n\S]+)* [^\s<]+ )? \s* # to trim the message
<
(?<author> (?: \w{3} \s* , \s* )*+ \w{3} )? \s*
(?<date> (?&DATE) )?
(?<flags> (?&FLAGS) )?
>
$
'''
rgx = regex.compile(p, regex.VERBOSE | regex.MULTILINE)
for m in rgx.finditer(s):
print (m.group('tag'))
Note: the subpatterns can be defined at the end of the pattern too.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With