Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Non-greedy regex quantifier gives greedy result

I have a .net regex which I am testing using Windows Powershell. The output is as follows:

> [System.Text.RegularExpressions.Regex]::Match("aaa aaa bbb", "aaa.*?bbb")


Groups   : {aaa aaa bbb}
Success  : True
Captures : {aaa aaa bbb}
Index    : 0
Length   : 11
Value    : aaa aaa bbb

My expectation was that using the ? quantifier would cause the match to be aaa bbb, as the second group of a's is sufficient to satisfy the expression. Is my understanding of non-greedy quantifiers flawed, or am I testing incorrectly?

Note: this is plainly not the same problem as Regular Expression nongreedy is greedy

like image 596
Dominic Cronin Avatar asked May 19 '13 09:05

Dominic Cronin


People also ask

How do you stop greedy in regex?

You make it non-greedy by using ". *?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ". *?" . This means that if for instance nothing comes after the ".

Is regex matching greedy?

The standard quantifiers in regular expressions are greedy, meaning they match as much as they can, only giving back as necessary to match the remainder of the regex. By using a lazy quantifier, the expression tries the minimal match first.

What does non-greedy mean in regex?

A non-greedy match means that the regex engine matches as few characters as possible—so that it still can match the pattern in the given string.

Why is regex greedy?

In general, the regex engine will try to match as many input characters as possible once it encounters a quantified token like \d+ or, in our case, . * . That behavior is called greedy matching because the engine will eagerly attempt to match anything it can.


2 Answers

This is a common misunderstanding. Lazy quantifiers do not guarantee the shortest possible match. They only make sure that the current quantifier, from the current position, does not match more characters than needed for an overall match.

If you truly want to ensure the shortest possible match, you need to make that explicit. In this case, this means that instead of .*?, you want a subregex that matches anything that is neither aaa nor bbb. The resulting regex will therefore be

aaa(?:(?!aaa|bbb).)*bbb
like image 181
Tim Pietzcker Avatar answered Nov 15 '22 00:11

Tim Pietzcker


Compare the result for the string aaa aaa bbb bbb:

regex: aaa.*?bbb 
result: aaa aaa bbb

regex: aaa.*bbb
result: aaa aaa bbb bbb

The regex engine finds first occurrence of aaa and then skips all characters (.*?) until first occurrence of bbb, but for the greedy operator (.*) it will go on to find a larger result and therefore match the last occurrence of bbb.

like image 42
j.holetzeck Avatar answered Nov 15 '22 01:11

j.holetzeck