Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to implement a verbose REGEX in Python

Tags:

python

regex

I am trying to use a verbose regular expression in Python (2.7). If it matters I am just trying to make it easier to go back and more clearly understand the expression sometime in the future. Because I am new I first created a compact expression to make sure I was getting what I wanted.

Here is the compact expression:

test_verbose_item_pattern = re.compile('\n{1}\b?I[tT][eE][mM]\s+\d{1,2}\.?\(?[a-e]?\)?.*[^0-9]\n{1}')

It works as expected

Here is the Verbose expression

verbose_item_pattern = re.compile("""
\n{1}       #begin with a new line allow only one new line character
\b?       #allow for a word boundary the ? allows 0 or 1 word boundaries \nITEM or \n  ITEM
I        # the first word on the line must begin with a capital I
[tT][eE][mM]  #then we need one character from each of the three sets this allows for unknown case
\s+       # one or more white spaces this does allow for another \n not sure if I should change it
\d{1,2}    # require one or two digits
\.?        # there could be 0 or 1 periods after the digits 1. or 1
\(?        # there might be 0 or 1 instance of an open paren
[a-e]?      # there could be 0 or 1 instance of a letter in the range a-e
\)?         # there could be 0 or 1 instance of a closing paren
.*          #any number of unknown characters so we can have words and punctuation
[^0-9]     # by its placement I am hoping that I am stating that I do not want to allow strings that end with a number and then \n
\n{1}     #I want to cut it off at the next newline character
""",re.VERBOSE)

The problem is that when I run the verbose pattern I get an exception

Traceback (most recent call last):
File "C:/Users/Dropbox/directEDGAR-Code-Examples/NewItemIdentifier.py", line 17, in <module>
 """,re.VERBOSE)
 File "C:\Python27\lib\re.py", line 190, in compile
  return _compile(pattern, flags)
 File "C:\Python27\lib\re.py", line 242, in _compile
 raise error, v # invalid expression
 error: nothing to repeat

I am afraid this is going to be something silly but I can't figure it out. I did take my verbose expressions and compact it line by line to make sure the compact version was the same as the verbose.

The error message states there is nothing to repeat?

like image 904
PyNEwbie Avatar asked Dec 13 '12 02:12

PyNEwbie


2 Answers

  • It is a good habit to use raw string literals when defining regex patterns. A lot of regex patterns use backslashes, and using a raw string literal will allow you to write single backslashes instead of having to worry about whether or not Python will interpret your backslash to have a different meaning (and having to use two backslashes in those cases).

  • \b? is not valid regex. This is saying 0-or-1 word boundaries. But either you have a word boundary or you don't. If you have a word boundary, then you have 1 word boundary. If you don't have a word boundary then you have 0 word boundaries. So \b? would (if it were valid regex) be always true.

  • Regex makes a distinction between the end of a string and the end of a line. (A string may consist of multiple lines.)

    • \A matches only the start of a string.
    • \Z matches only the end of a string.
    • $ matches the end of a string, and also end of a line in re.MULTILINE mode.
    • ^ matches the start of a string, and also start of a line in re.MULTILINE mode.

import re
verbose_item_pattern = re.compile(r"""
    $            # end of line boundary
    \s{1,2}      # 1-or-2 whitespace character, including the newline
    I            # a capital I
    [tT][eE][mM] # one character from each of the three sets this allows for unknown case
    \s+          # 1-or-more whitespaces INCLUDING newline
    \d{1,2}      # 1-or-2 digits
    [.]?         # 0-or-1 literal .
    \(?          # 0-or-1 literal open paren
    [a-e]?       # 0-or-1 letter in the range a-e
    \)?          # 0-or-1 closing paren
    .*           # any number of unknown characters so we can have words and punctuation
    [^0-9]       # anything but [0-9]
    $            # end of line boundary
    """, re.VERBOSE|re.MULTILINE)

x = verbose_item_pattern.search("""
 Item 1.0(a) foo bar
""")

print(x)

yields

<_sre.SRE_Match object at 0xb76dd020>

(indicating there is a match)

like image 170
unutbu Avatar answered Sep 29 '22 11:09

unutbu


As say in the comment you should escape your backslash or use raw string even with triple quote.

verbose_item_pattern = re.compile(r"""
...
like image 29
Ghislain Hivon Avatar answered Sep 29 '22 10:09

Ghislain Hivon