Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex with look behind and alternatives

Tags:

python

regex

I want to have a regular expression that finds the texts that are "wrapped" in between "HEAD or HEADa" and "HEAD. That is, I may have a text that starts with the first word as HEAD or HEADa and the following "heads" are of type HEAD.

  1. HEAD\n\n text...text...HEAD \n\n text....text HEAD\n\n text....text .....
  2. HEADa\n\n text...text...HEAD \n\n text....text HEAD\n\n text....text .....

I want only to capture the text that are in between the "heads" therefore I have a regex with look behind and look ahead expressions looking for my "heads". I have the following regex:

var = "HEADa", "HEAD"

my_pat = re.compile(r"(?<=^\b"+var[0]+r"|"+var[1]+r"\b) \w*\s\s(.*?)(?=\b"+var[1] +r"\b)",re.DOTALL|re.MULTILINE)

However, when I try to execute this regex, I am getting an error message saying that I cannot have variable length in the look behind expression. What is wrong with this regex?

like image 332
andreSmol Avatar asked Nov 19 '11 13:11

andreSmol


People also ask

What is look behind in regex?

Regex Lookbehind is used as an assertion in Python regular expressions(re) to determine success or failure whether the pattern is behind i.e to the right of the parser's current position.

Does Python support negative Lookbehind?

The (? <! \$) is a negative lookbehind that does not match the $ sign. The \d+ matches a number with one or more digits.

What is a positive Lookbehind in regex?

In positive lookbehind the regex engine searches for an element ( character, characters or a group) just before the item matched. In case it finds that specific element before the match it declares a successful match otherwise it declares it a failure.


1 Answers

Currently, the first part of your regex looks like this:

(?<=^\bHEADa|HEAD\b)

You have two alternatives; one matches five characters and the other matches four, and that's why you get the error. Some regex flavors will let you do that even though they say they don't allow variable-length lookbehinds, but not Python. You could break it up into two lookbehinds, like this:

(?:(?<=^HEADa\b)|(?<=\bHEAD\b))

...but you probably don't need lookbehinds for this anyway. Try this instead:

(?:^HEADa|\bHEAD)\b

Whatever gets matched by the (.*?) later on will still be available through group #1. If you really need the whole of the text between the delimiters, you can capture that in group #1, and that other group will become #2 (or you can use named groups, and not have to keep track of the numbers).

Generally speaking, lookbehind should never be your first resort. It may seem like the obvious tool for the job, but you're usually better off doing a straight match and extracting the part you want with a capturing group. And that's true of all flavors, not just Python; just because you can do more with lookbehinds in other flavors doesn't mean you should.

BTW, you may have noticed that I redistributed your word boundaries; I think this is what you really intended.

like image 70
Alan Moore Avatar answered Sep 22 '22 15:09

Alan Moore