Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python re.sub use non-greedy mode (.*?) with end of string ($) it comes greedy!

Code:

str = '<br><br />A<br />B'
print(re.sub(r'<br.*?>\w$', '', str))

It is expected to return <br><br />A, but it returns an empty string ''!

Any suggestion?

like image 685
Jet Guo Avatar asked Nov 25 '10 05:11

Jet Guo


People also ask

What is greedy and non-greedy matching in Python?

So the difference between the greedy and the non-greedy match is the following: The greedy match will try to match as many repetitions of the quantified pattern as possible. The non-greedy match will try to match as few repetitions of the quantified pattern as possible.

How do you make a non-greedy regex in Python?

Non-greedy quantifiers match their preceding elements as little as possible to return the smallest possible match. Add a question mark (?) to a quantifier to turn it into a non-greedy quantifier.

How do I stop regex greedy?

You make it non-greedy by using ". *?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ". *?" . This means that if for instance nothing comes after the ".

What does greedy regex mean?

The standard quantifiers in regular expressions are greedy, meaning they match as much as they can, only giving back as necessary to match the remainder of the regex. By using a lazy quantifier, the expression tries the minimal match first.


1 Answers

Greediness works from left to right, but not otherwise. It basically means "don't match unless you failed to match". Here's what's going on:

  1. The regex engine matches <br at the start of the string.
  2. .*? is ignored for now, it is lazy.
  3. Try to match >, and succeeds.
  4. Try to match \w and fails. Now it's interesting - the engine starts backtracking, and sees the .*? rule. In this case, . can match the first >, so there's still hope for that match.
  5. This keep happening until the regex reaches the slash. Then >\w can match, but $ fails. Again, the engine comes back to the lazy .* rule, and keeps matching, until it matches<br><br />A<br />B

Luckily, there's an easy solution: By replacing <br[^>]*>\w$ you don't allow matching outside of your tags, so it should replace the last occurrence.
Strictly speaking, this doesn't work well for HTML, because tag attributes can contain > characters, but I assume it's just an example.

like image 141
Kobi Avatar answered Oct 20 '22 01:10

Kobi