I have a regex:
(.*\n)+DOCUMENTATION.*(\"\"\"|''')\n-*\n?((.*\n)+?)(\2)(?s:.*)
witch I'm trying to process some files like this:
#!/usr/bin/python
# -*- coding: utf-8 -*-
# <GNU license here>
DOCUMENTATION = """
module: foo
short_description: baz
<some more here>
"""
<rest of the python code>
I need to get the DOCUMENTATION part from it.
It work quite well but not with python. The problem is with inline modifier ?s:.*
which I used to catch rest of the file (any character including new-line zero or more times). Looks that it's somehow different in python.
Here at regex101 is the example. It shows an error when I switch it to python.
NOTE: I can't set modifiers globally. (I can only pass regex rule to some python module).
Python implements inline (embedded) modifiers, such as (?s)
, (?i)
or (?aiLmsux)
, but not as part of a non-capturing group modifier like you were trying to use.(?smi:subpattern)
works in Perl and PCRE, but not in Python.
Moreover, using an inline modifier anywhere in the pattern applies to the whole match and it can't be turned off.
From regular-expressions.info:
In Python, putting a modifier in the middle of the regex affects the whole regex. So in Python,(?i)caseless
andcaseless(?i)
are both case insensitive.
Example:
import re
text = "A\nB"
print("Text: '%s'\n---" % text)
patterns = [ "a", "a(?i)", "A.*B", "A(?s).*B", "A.*(?s)B"]
for p in patterns:
match = re.search( p, text)
print("Pattern: '%s' \tMatch: %s" % (p, match.span() if match else None))
Output:
Text: 'A
B'
---
Pattern: 'a' Match: None
Pattern: 'a(?i)' Match: (0, 1)
Pattern: 'A.*B' Match: None
Pattern: 'A(?s).*B' Match: (0, 3)
Pattern: 'A.*(?s)B' Match: (0, 3)
ideone Demo
(?s)
(aka singleline or re.DOTALL
) makes .
also match newlines. And since you're trying to set it to only a part of the pattern, there are 2 alternatives:
(?s)
for the whole pattern (either passed as flag or inline), and use [^\n]*
instead of a dot, to match any characters except newlines.[\S\s]*
instead of a dot, to match any characters including newlines. The character class includes all whitespace and all that is not a whitespace (thus, all characters).
For the specific case you presented, you can use the following expression:
(?m)^DOCUMENTATION.*(\"{3}|'{3})\n-*\n?([\s\S]+?)^\1[\s\S]*
regex101 Demo
Note: This post covers inline modifiers in the re module, whereas Matthew Barnett's regex module does in fact implement inline modifiers (scoped flags) with the same behaviour observed in PCRE and Perl.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With