Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex: Is Lazy Worse?

I have always written regexes like this

<A HREF="([^"]*)" TARGET="_blank">([^<]*)</A>

but I just learned about this lazy thing and that I can write it like this

<A HREF="(.*?)" TARGET="_blank">(.*?)</A>

is there any disadvantage to using this second approach? The regex is definitely more compact (even SO parses it better).

Edit: There are two best answers here, which point out two important differences between the expressions. ysth's answer points to a weakness in the non-greedy/lazy one, in which the hyperlink itself could possibly include other attributes of the A tag (definitely not good). Rob Kennedy points out a weakness in the greedy example, in that anchor texts cannot include other tags (definitely not okay, because it wouldn't grab all the anchor text either)... so the answer is that, regular expressions being what they are, lazy and non-lazy solutions that seem the same are probably not semantically equivalent.

Edit: Third best answer is by Alan M about relative speed of the expressions. For the time being, I'll mark his as best answer so people give him more points :)

like image 976
Dan Rosenstark Avatar asked Dec 14 '08 18:12

Dan Rosenstark


People also ask

What does lazy mean in regex?

'Lazy' means match shortest possible string. For example, the greedy h. +l matches 'hell' in 'hello' but the lazy h. +? l matches 'hel' .

What does non-greedy mean regex?

A non-greedy match means that the regex engine matches as few characters as possible—so that it still can match the pattern in the given string.

What is a lazy quantifier?

With a lazy quantifier, the engine starts out by matching as few of the tokens as the quantifier allows. For instance, with A*, the engine starts out matching zero characters, since *allows the engine to match "zero or more".


2 Answers

Another thing to consider is how long the target text is, and how much of it is going to be matched by the quantified subexpression. For example, if you were trying to match the whole <BODY> element in a large HTML document, you might be tempted to use this regex:

/<BODY>.*?<\/BODY>/is

But that's going to do a whole lot of unnecessary work, matching one character at a time while effectively doing a negative lookahead before each one. You know the </BODY> tag is going to be very near the end of the document, so the smart thing to do is to use a normal greedy quantitier; let it slurp up the whole rest of the document and then backtrack the few characters necessary to match the end tag.

In most cases you won't notice any speed difference between greedy and reluctant quantifiers, but it's something to keep in mind. The main reason why you should be judicious in your use of reluctant quantifiers is the one that was pointed out by the others: they may do it reluctantly, but they will match more than you want them to if that's what it takes to achieve an overall match.

like image 177
Alan Moore Avatar answered Oct 29 '22 20:10

Alan Moore


The complemented character class more rigorously defines what you want to match, so whenever you can, I'd use it.

The non greedy regex will match things you probably don't want, such as:

<A HREF="foo" NAME="foo" TARGET="_blank">foo</A>

where your first .*? matches

foo" NAME="foo
like image 40
ysth Avatar answered Oct 29 '22 19:10

ysth