Suppose I'm searching for anchor links in a web page. A regex that works is:
"\<a\s+.*?\>"
However, lets add a complication. Lets suppose that I only want links which surround specific text, for instance, the word 'next'. Normally, I would think all I had to do is:
"\<a\s+.*?\>next"
But I find that now, if there are 3 anchor tags in a page, and the third one has 'next' after it, that the regex search finds a huge string extending from the first anchor tag, and extending to the third anchor tag. This makes sense if the period-asterisk-questionmark is finding all characters until it comes across ">next". But that is not what I want. I want to find all characters until it comes across ">", and then an additional constraint should be that right after the ">" there should be "next".
How do I get this to work?
The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient. Instead of matching till the first occurrence of '>', it extracted the whole string.
You make it non-greedy by using ". *?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ". *?" . This means that if for instance nothing comes after the ".
Greedy. By default the regular expression engine tries to repeat the quantified character as many times as possible. For instance, \d+ consumes all possible digits. When it becomes impossible to consume more (no more digits or string end), then it continues to match the rest of the pattern.
A greedy match means that the regex engine (the one which tries to find your pattern in the string) matches as many characters as possible. What is this? Report Ad. For example, the regex 'a+' will match as many 'a' s as possible in your string 'aaaa' .
You can fix your regex by prohibiting it from matching >
inside the tag, i.e. by replacing .
with [^>]
:
"\<a\s+[^>]*?\>next"
.*?
matches any number of characters. The fact that you made it reluctant does not make it stop at >
: it continues matching past it, until it finds >next
at the end. This is not greedy, because the expression matched as little as possible to obtain a match. It's just that no shorter matches were available.
Demo.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With