Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Greedy vs. Reluctant vs. Possessive Qualifiers

I found this tutorial on regular expressions and while I intuitively understand what "greedy", "reluctant" and "possessive" qualifiers do, there seems to be a serious hole in my understanding.

Specifically, in the following example:

Enter your regex: .*foo // Greedy qualifier Enter input string to search: xfooxxxxxxfoo I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.  Enter your regex: .*?foo // Reluctant qualifier Enter input string to search: xfooxxxxxxfoo I found the text "xfoo" starting at index 0 and ending at index 4. I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.  Enter your regex: .*+foo // Possessive qualifier Enter input string to search: xfooxxxxxxfoo No match found. 

The explanation mentions eating the entire input string, letters been consumed, matcher backing off, rightmost occurrence of "foo" has been regurgitated, etc.

Unfortunately, despite the nice metaphors, I still don't understand what is eaten by whom... Do you know of another tutorial that explains (concisely) how regular expression engines work?

Alternatively, if someone can explain in somewhat different phrasing the following paragraph, that would be much appreciated:

The first example uses the greedy quantifier .* to find "anything", zero or more times, followed by the letters "f", "o", "o". Because the quantifier is greedy, the .* portion of the expression first eats the entire input string. At this point, the overall expression cannot succeed, because the last three letters ("f", "o", "o") have already been consumed [by whom?]. So the matcher slowly backs off [from right-to-left?] one letter at a time until the rightmost occurrence of "foo" has been regurgitated [what does this mean?], at which point the match succeeds and the search ends.

The second example, however, is reluctant, so it starts by first consuming [by whom?] "nothing". Because "foo" doesn't appear at the beginning of the string, it's forced to swallow [who swallows?] the first letter (an "x"), which triggers the first match at 0 and 4. Our test harness continues the process until the input string is exhausted. It finds another match at 4 and 13.

The third example fails to find a match because the quantifier is possessive. In this case, the entire input string is consumed by .*+ [how?], leaving nothing left over to satisfy the "foo" at the end of the expression. Use a possessive quantifier for situations where you want to seize all of something without ever backing off [what does back off mean?]; it will outperform the equivalent greedy quantifier in cases where the match is not immediately found.

like image 637
Regex Rookie Avatar asked Mar 16 '11 00:03

Regex Rookie


People also ask

What are reluctant quantifiers?

A reluctant quantifier indicates the search engine to start with the shortest possible piece of the string. Once match found, the engine continue; otherwise it adds one character to the section of the string being checked and search that, and so on.

What are greedy quantifiers?

Greedy quantifiers − Greedy quantifiers are the default quantifiers. A greedy quantifier matches as much as possible from the input string (longest match possible) if match not occurred it leaves the last character and matches again.

What is a possessive quantifier?

A possessive quantifier is similar to greedy quantifier. It indicates the engine to start by checking the entire string.It is different in the sense if it doesn't work, if match failed and there is no looking back. Following are various examples of Possessive Quantifiers using regular expression in java.

What is possessive quantifier regex?

An quantifier in a regular expression may be greedy (the default), reluctant, or possesive. A possesive quantifier does this: The match starts with the first unmatched character in the string. The possessive quantifier starts matching from left to right one character at a time.


2 Answers

I'll give it a shot.

A greedy quantifier first matches as much as possible. So the .* matches the entire string. Then the matcher tries to match the f following, but there are no characters left. So it "backtracks", making the greedy quantifier match one less character (leaving the "o" at the end of the string unmatched). That still doesn't match the f in the regex, so it backtracks one more step, making the greedy quantifier match one less character again (leaving the "oo" at the end of the string unmatched). That still doesn't match the f in the regex, so it backtracks one more step (leaving the "foo" at the end of the string unmatched). Now, the matcher finally matches the f in the regex, and the o and the next o are matched too. Success!

A reluctant or "non-greedy" quantifier first matches as little as possible. So the .* matches nothing at first, leaving the entire string unmatched. Then the matcher tries to match the f following, but the unmatched portion of the string starts with "x" so that doesn't work. So the matcher backtracks, making the non-greedy quantifier match one more character (now it matches the "x", leaving "fooxxxxxxfoo" unmatched). Then it tries to match the f, which succeeds, and the o and the next o in the regex match too. Success!

In your example, it then starts the process over with the remaining unmatched portion of the string, "xxxxxxfoo", following the same process.

A possessive quantifier is just like the greedy quantifier, but it doesn't backtrack. So it starts out with .* matching the entire string, leaving nothing unmatched. Then there is nothing left for it to match with the f in the regex. Since the possessive quantifier doesn't backtrack, the match fails there.

like image 108
Anomie Avatar answered Oct 03 '22 13:10

Anomie


It is just my practice output to visualise the scene-

Visual Image

like image 42
SIslam Avatar answered Oct 03 '22 14:10

SIslam