I need help about regular expression matching with non-greedy option.
The match pattern is:
<img\s.*>
The text to match is:
<html> <img src="test"> abc <img src="a" src='a' a=b> </html>
I test on http://regexpal.com
This expression matches all text from <img
to last >
. I need it to match with the first encountered >
after the initial <img
, so here I'd need to get two matches instead of the one that I get.
I tried all combinations of non-greedy ?
, with no success.
discuss just what it means to be greedy. backing up until it can match an 'ab' (this is called backtracking). To make the quantifier non-greedy you simply follow it with a '?' the first 3 characters and then the following 'ab' is matched.
It means the greedy quantifiers will match their preceding elements as much as possible to return to the biggest match possible. On the other hand, the non-greedy quantifiers will match as little as possible to return the smallest match possible. non-greedy quantifiers are the opposite of greedy ones.
Once the regex engine encounters the first . * , it'll match every character until the end of the input because the star quantifier is greedy. However, the token following the "anything" is a comma, which means that the regex engine has to backtrack until its current position is in front of a comma.
Regular expressions aren't greedy by default, but their quantifiers are :-) It seems to me the real question is, why are lazy quantifiers more poorly supported and/or awkward to use than greedy ones?
The non-greedy ?
works perfectly fine. It's just that you need to select dot matches all option in the regex engines (regexpal, the engine you used, also has this option) you are testing with. This is because, regex engines generally don't match line breaks when you use .
. You need to tell them explicitly that you want to match line-breaks too with .
For example,
<img\s.*?>
works fine!
Check the results here.
Also, read about how dot behaves in various regex flavours.
The ?
operand makes match non-greedy. E.g. .*
is greedy while .*?
isn't. So you can use something like <img.*?>
to match the whole tag. Or <img[^>]*>
.
But remember that the whole set of HTML can't be actually parsed with regular expressions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With