Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

My regex expression is both lazy and greedy. Why?

Tags:

regex

Suppose I'm searching for anchor links in a web page. A regex that works is:

 "\<a\s+.*?\>"

However, lets add a complication. Lets suppose that I only want links which surround specific text, for instance, the word 'next'. Normally, I would think all I had to do is:

 "\<a\s+.*?\>next"

But I find that now, if there are 3 anchor tags in a page, and the third one has 'next' after it, that the regex search finds a huge string extending from the first anchor tag, and extending to the third anchor tag. This makes sense if the period-asterisk-questionmark is finding all characters until it comes across ">next". But that is not what I want. I want to find all characters until it comes across ">", and then an additional constraint should be that right after the ">" there should be "next".

How do I get this to work?

like image 563
Gordon Dugan Avatar asked Apr 03 '16 11:04

Gordon Dugan


People also ask

Why is regex greedy?

The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient. Instead of matching till the first occurrence of '>', it extracted the whole string.

How do I stop regex greedy?

You make it non-greedy by using ". *?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ". *?" . This means that if for instance nothing comes after the ".

Is regex greedy by default?

Greedy. By default the regular expression engine tries to repeat the quantified character as many times as possible. For instance, \d+ consumes all possible digits. When it becomes impossible to consume more (no more digits or string end), then it continues to match the rest of the pattern.

What is greedy search in regex?

A greedy match means that the regex engine (the one which tries to find your pattern in the string) matches as many characters as possible. What is this? Report Ad. For example, the regex 'a+' will match as many 'a' s as possible in your string 'aaaa' .


1 Answers

You can fix your regex by prohibiting it from matching > inside the tag, i.e. by replacing . with [^>]:

"\<a\s+[^>]*?\>next"

.*? matches any number of characters. The fact that you made it reluctant does not make it stop at >: it continues matching past it, until it finds >next at the end. This is not greedy, because the expression matched as little as possible to obtain a match. It's just that no shorter matches were available.

Demo.

like image 189
Sergey Kalinichenko Avatar answered Sep 29 '22 07:09

Sergey Kalinichenko