Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding Quantifiers

I was going through the Java Tutorial on Quantifiers.

There is a difference mentioned among Differences Among Greedy, Reluctant, and Possessive Quantifiers.

I am not able to understand what's the difference exactly.

Explanation provided as follows:

Enter your regex: .*foo  // greedy quantifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.

Enter your regex: .*?foo  // reluctant quantifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfoo" starting at index 0 and ending at index 4.
I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.

Enter your regex: .*+foo // possessive quantifier
Enter input string to search: xfooxxxxxxfoo
No match found.

The first example uses the greedy quantifier .* to find "anything", zero or more times, followed by the letters "f" "o" "o". Because the quantifier is greedy, the .* portion of the expression first eats the entire input string. At this point, the overall expression cannot succeed, because the last three letters ("f" "o" "o") have already been consumed. So the matcher slowly backs off one letter at a time until the rightmost occurrence of "foo" has been regurgitated, at which point the match succeeds and the search ends.

The second example, however, is reluctant, so it starts by first consuming "nothing". Because "foo" doesn't appear at the beginning of the string, it's forced to swallow the first letter (an "x"), which triggers the first match at 0 and 4. Our test harness continues the process until the input string is exhausted. It finds another match at 4 and 13.

The third example fails to find a match because the quantifier is possessive. In this case, the entire input string is consumed by .*+, leaving nothing left over to satisfy the "foo" at the end of the expression. Use a possessive quantifier for situations where you want to seize all of something without ever backing off; it will outperform the equivalent greedy quantifier in cases where the match is not immediately found.

like image 969
xyz Avatar asked Jan 31 '17 11:01

xyz


1 Answers

General rules

A basic knowledge of the quantifiers ?, *, and + (respectively, "zero or one", "zero or more", "one or more") is understood.

  • We say that a quantifier is greedy if it tries to elaborate as many characters as possible.
  • We say that a quantifier is reluctant (lazy) if it tries to elaborate less characters as possible.
  • We say that a quantifier is possessive if it is greedy and it do not allow to backtrace.

You can understand what "backtrace" means only if you know how regex parsers work (see "Dynamic example" below).

Single-case explanations

  • ?: first test 1 occurrence, then 0; if you have found 1 occurrence and then you need to discard it, you can do it
  • ??: first test 0 occurrences, then 1
  • ?+: first test 1 occurrence, then 0; if you have found 1 occurrence and then you need to discard it, you can't do it
  • *: try to get as many occurrences as possible (even 0); if you have found N occurrences and then you need to discard (some of) them, you can do it, starting from the last
  • *?: try to get less occurrences as possible (even 0)
  • *+: try to get as many occurrences as possible (even 0); if you have found N occurrences and then you need to discard (some of) them, you can't do it
  • +: try to get as many occurrences as possible (at least 1); if you have found N occurrences and then you need to discard (some of) them, you can do it, starting from the last
  • +?: try to get less occurrences as possible (at least 1)
  • ++: try to get as many occurrences as possible (at least 1); if you have found N occurrences and then you need to discard (some of) them, you can't do it

Dynamic example

In this section I'll try to show you what is the logic behind a regex parser:

1) Case /.*foo/:

First it's the turn of subpattern /.*/. It starts elaborating the first character:

xfooxxxxxxfoo
^

Then it tries to elaborate as many characters as possible:

xfooxxxxxxfoo
^^
xfooxxxxxxfoo
^^^
[...]
xfooxxxxxxfoo
^^^^^^^^^^^
xfooxxxxxxfoo
^^^^^^^^^^^^
xfooxxxxxxfoo
^^^^^^^^^^^^^

The cursor reached the end, but the subpattern /foo/ hasn't played its role yet. So the cursor "goes back" for allowing the subpattern /foo/ to get a match:

xfooxxxxxxfoo
^^^^^^^^^^^^

/foo/ still cannot get a match, so we need to go back again:

xfooxxxxxxfoo
^^^^^^^^^^^
xfooxxxxxxfoo
^^^^^^^^^^

Now the subpattern /foo/ can get a match:

xfooxxxxxxfoo
^^^^^^^^^^^^^

So the match is the whole string xfooxxxxxxfoo.

2) Case /.*?foo/:

First it's the turn of subpattern /.*?/. It is lazy, so we'd like it maches 0 characters. But if it did, the subpattern /foo/ couldn't get a match, so it has to elaborate one character:

xfooxxxxxxfoo
^

Now it's foo's turn:

xfooxxxxxxfoo
^^^^

So the match is xfoo.

(if you set the type global for the regex, then the parser would restart from the first character after the match, giving a second match xxxxxxfoo)

3) Case /.*+foo/:

First it's the turn of subpattern /.*+/. It tries to elaborate as many characters as possible:

xfooxxxxxxfoo
^
[...]
xfooxxxxxxfoo
^^^^^^^^^^^^^

The cursor reached the end, but the subpattern /foo/ hasn't played its role yet. So the cursor "goes back"... oh no, what a pity, it can't (because it is possessive)!

So we have no matches.

like image 153
horcrux Avatar answered Oct 07 '22 16:10

horcrux