Python regex with look behind and alternatives

Q: Does Python support negative Lookbehind?

The (? <! \$) is a negative lookbehind that does not match the $ sign. The \d+ matches a number with one or more digits.

Q: What is a positive Lookbehind in regex?

In positive lookbehind the regex engine searches for an element ( character, characters or a group) just before the item matched. In case it finds that specific element before the match it declares a successful match otherwise it declares it a failure.

Tags:

python

regex

I want to have a regular expression that finds the texts that are "wrapped" in between "HEAD or HEADa" and "HEAD. That is, I may have a text that starts with the first word as HEAD or HEADa and the following "heads" are of type HEAD.

HEAD\n\n text...text...HEAD \n\n text....text HEAD\n\n text....text .....
HEADa\n\n text...text...HEAD \n\n text....text HEAD\n\n text....text .....

I want only to capture the text that are in between the "heads" therefore I have a regex with look behind and look ahead expressions looking for my "heads". I have the following regex:

var = "HEADa", "HEAD"

my_pat = re.compile(r"(?<=^\b"+var[0]+r"|"+var[1]+r"\b) \w*\s\s(.*?)(?=\b"+var[1] +r"\b)",re.DOTALL|re.MULTILINE)

However, when I try to execute this regex, I am getting an error message saying that I cannot have variable length in the look behind expression. What is wrong with this regex?

332

asked Nov 19 '11 13:11

andreSmol

1 Answers

Currently, the first part of your regex looks like this:

(?<=^\bHEADa|HEAD\b)

You have two alternatives; one matches five characters and the other matches four, and that's why you get the error. Some regex flavors will let you do that even though they say they don't allow variable-length lookbehinds, but not Python. You could break it up into two lookbehinds, like this:

(?:(?<=^HEADa\b)|(?<=\bHEAD\b))

...but you probably don't need lookbehinds for this anyway. Try this instead:

(?:^HEADa|\bHEAD)\b

Whatever gets matched by the (.*?) later on will still be available through group #1. If you really need the whole of the text between the delimiters, you can capture that in group #1, and that other group will become #2 (or you can use named groups, and not have to keep track of the numbers).

Generally speaking, lookbehind should never be your first resort. It may seem like the obvious tool for the job, but you're usually better off doing a straight match and extracting the part you want with a capturing group. And that's true of all flavors, not just Python; just because you can do more with lookbehinds in other flavors doesn't mean you should.

BTW, you may have noticed that I redistributed your word boundaries; I think this is what you really intended.

answered Sep 22 '22 15:09

Alan Moore

Related questions
                            
                                Speeding up iterating over Numpy Arrays
                            
                                epydoc AttributeError: 'Text' object has no attribute 'data'
                            
                                Changing a unix timestamp to a different timezone
                            
                                Case sensitive path comparison in python
                            
                                python - same instruction, different outcome
                            
                                Appending a list to itself in Python
                            
                                Python: handling a large set of data. Scipy or Rpy? And how?
                            
                                Creating a matrix of options using itertools
                            
                                How to extract certain parts of a web page in Python
                            
                                python method as argument
                            
                                Applying SVD throws a Memory Error instantaneously?
                            
                                pass session cookies in http header with python urllib2?
                            
                                Appengine: put_async doesn't work (at least in the development server)?
                            
                                Python strip() unicode string?
                            
                                I want to return a value AND raise an exception, does this mean I'm doing something wrong?
                            
                                Get input file name in streaming hadoop program
                            
                                How to break out of the loop only if a certain case is met, but then continue the iteration?
                            
                                Will setuptools work with python 3.2.x
                            
                                Pythonic way to convert list of dicts into list of namedtuples
                            
                                Is adding attributes dynamically frowned upon in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With