Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use inline regex modifier in python [duplicate]

Tags:

python

regex

I have a regex:

(.*\n)+DOCUMENTATION.*(\"\"\"|''')\n-*\n?((.*\n)+?)(\2)(?s:.*)

witch I'm trying to process some files like this:

#!/usr/bin/python
# -*- coding: utf-8 -*-

# <GNU license here>

DOCUMENTATION = """
module: foo
short_description: baz
<some more here>    
"""

<rest of the python code>

I need to get the DOCUMENTATION part from it.

It work quite well but not with python. The problem is with inline modifier ?s:.* which I used to catch rest of the file (any character including new-line zero or more times). Looks that it's somehow different in python.

Here at regex101 is the example. It shows an error when I switch it to python.

NOTE: I can't set modifiers globally. (I can only pass regex rule to some python module).

like image 744
pawel7318 Avatar asked Feb 05 '15 21:02

pawel7318


1 Answers

Inline Modifiers in the re module

Python implements inline (embedded) modifiers, such as (?s), (?i) or (?aiLmsux), but not as part of a non-capturing group modifier like you were trying to use.
(?smi:subpattern) works in Perl and PCRE, but not in Python.

Moreover, using an inline modifier anywhere in the pattern applies to the whole match and it can't be turned off.

From regular-expressions.info:
In Python, putting a modifier in the middle of the regex affects the whole regex. So in Python, (?i)caseless and caseless(?i) are both case insensitive.


Example:

import re

text = "A\nB"
print("Text: '%s'\n---" % text)
patterns = [ "a", "a(?i)", "A.*B", "A(?s).*B", "A.*(?s)B"]

for p in patterns:
    match = re.search( p, text)
    print("Pattern: '%s'    \tMatch: %s" % (p, match.span() if match else None))

Output:

Text: 'A
B'
---
Pattern: 'a'            Match: None
Pattern: 'a(?i)'        Match: (0, 1)
Pattern: 'A.*B'         Match: None
Pattern: 'A(?s).*B'     Match: (0, 3)
Pattern: 'A.*(?s)B'     Match: (0, 3)

ideone Demo


Solution

(?s) (aka singleline or re.DOTALL) makes . also match newlines. And since you're trying to set it to only a part of the pattern, there are 2 alternatives:

  1. Match anything except newlines:
    Set (?s) for the whole pattern (either passed as flag or inline), and use [^\n]* instead of a dot, to match any characters except newlines.
  2. Match everything including newlines:
    Use [\S\s]* instead of a dot, to match any characters including newlines. The character class includes all whitespace and all that is not a whitespace (thus, all characters).


For the specific case you presented, you can use the following expression:

(?m)^DOCUMENTATION.*(\"{3}|'{3})\n-*\n?([\s\S]+?)^\1[\s\S]*

regex101 Demo


Note: This post covers inline modifiers in the re module, whereas Matthew Barnett's regex module does in fact implement inline modifiers (scoped flags) with the same behaviour observed in PCRE and Perl.

like image 86
Mariano Avatar answered Oct 04 '22 15:10

Mariano