Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split long regular expression rules to multiple lines in Python

Tags:

python

regex

Is this actually doable? I have some very long regex pattern rules that are hard to understand because they don't fit into the screen at once. Example:

test = re.compile('(?P<full_path>.+):\d+:\s+warning:\s+Member\s+(?P<member_name>.+)\s+\((?P<member_type>%s)\) of (class|group|namespace)\s+(?P<class_name>.+)\s+is not documented' % (self.__MEMBER_TYPES), re.IGNORECASE) 

Backslash or triple quotes won't work.

EDIT. I ended using the VERBOSE mode. Here's how the regexp pattern looks now:

test = re.compile('''   (?P<full_path>                                  # Capture a group called full_path     .+                                            #   It consists of one more characters of any type   )                                               # Group ends                         :                                               # A literal colon   \d+                                             # One or more numbers (line number)   :                                               # A literal colon   \s+warning:\s+parameters\sof\smember\s+         # An almost static string   (?P<member_name>                                # Capture a group called member_name     [                                             #          ^:                                          #   Match anything but a colon (so finding a colon ends group)     ]+                                            #   Match one or more characters    )                                              # Group ends    (                                              # Start an unnamed group       ::                                           #   Two literal colons      (?P<function_name>                           #   Start another group called function_name        \w+                                        #     It consists on one or more alphanumeric characters      )                                            #   End group    )*                                             # This group is entirely optional and does not apply to C    \s+are\snot\s\(all\)\sdocumented''',           # And line ends with an almost static string    re.IGNORECASE|re.VERBOSE)                      # Let's not worry about case, because it seems to differ between Doxygen versions 
like image 421
Makis Avatar asked Nov 04 '11 08:11

Makis


People also ask

How do you split a regular expression in Python?

If you want to split a string that matches a regular expression (regex) instead of perfect match, use the split() of the re module. In re. split() , specify the regex pattern in the first parameter and the target character string in the second parameter.

How do you split a string into multiple lines in Python?

Use triple quotes to create a multiline string It is the simplest method to let a long string split into different lines. You will need to enclose it with a pair of Triple quotes, one at the start and second in the end. Anything inside the enclosing Triple quotes will become part of one multiline string.

How do you break a regular expression?

If you want to indicate a line break when you construct your RegEx, use the sequence “\r\n”. Whether or not you will have line breaks in your expression depends on what you are trying to match. Line breaks can be useful “anchors” that define where some pattern occurs in relation to the beginning or end of a line.

How do you split a string with special characters in Python?

Use the re. split() method to split a string on all special characters. The re. split() method takes a pattern and a string and splits the string on each occurrence of the pattern.


2 Answers

You can split your regex pattern by quoting each segment. No backslashes needed.

test = re.compile(('(?P<full_path>.+):\d+:\s+warning:\s+Member'                    '\s+(?P<member_name>.+)\s+\((?P<member_type>%s)\) '                    'of (class|group|namespace)\s+(?P<class_name>.+)'                    '\s+is not documented') % (self.__MEMBER_TYPES), re.IGNORECASE) 

You can also use the raw string flag 'r' and you'll have to put it before each segment.

See the docs.

like image 180
naeg Avatar answered Sep 22 '22 06:09

naeg


From http://docs.python.org/reference/lexical_analysis.html#string-literal-concatenation:

Multiple adjacent string literals (delimited by whitespace), possibly using different quoting conventions, are allowed, and their meaning is the same as their concatenation. Thus, "hello" 'world' is equivalent to "helloworld". This feature can be used to reduce the number of backslashes needed, to split long strings conveniently across long lines, or even to add comments to parts of strings, for example:

re.compile("[A-Za-z_]"       # letter or underscore            "[A-Za-z0-9_]*"   # letter, digit or underscore           ) 

Note that this feature is defined at the syntactical level, but implemented at compile time. The ‘+’ operator must be used to concatenate string expressions at run time. Also note that literal concatenation can use different quoting styles for each component (even mixing raw strings and triple quoted strings).

like image 39
N3dst4 Avatar answered Sep 23 '22 06:09

N3dst4