Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression: who's greedier?

Tags:

java

regex

My primary concern is with the Java flavor, but I'd also appreciate information regarding others.

Let's say you have a subpattern like this:

(.*)(.*)

Not very useful as is, but let's say these two capture groups (say, \1 and \2) are part of a bigger pattern that matches with backreferences to these groups, etc.

So both are greedy, in that they try to capture as much as possible, only taking less when they have to.

My question is: who's greedier? Does \1 get first priority, giving \2 its share only if it has to?

What about:

(.*)(.*)(.*)

Let's assume that \1 does get first priority. Let's say it got too greedy, and then spit out a character. Who gets it first? Is it always \2 or can it be \3?

Let's assume it's \2 that gets \1's rejection. If this still doesn't work, who spits out now? Does \2 spit to \3, or does \1 spit out another to \2 first?


Bonus question

What happens if you write something like this:

(.*)(.*?)(.*)

Now \2 is reluctant. Does that mean \1 spits out to \3, and \2 only reluctantly accepts \3's rejection?


Example

Maybe it was a mistake for me not to give concrete examples to show how I'm using these patterns, but here's some:

System.out.println(
    "OhMyGod=MyMyMyOhGodOhGodOhGod"
    .replaceAll("^(.*)(.*)(.*)=(\\1|\\2|\\3)+$", "<$1><$2><$3>")
); // prints "<Oh><My><God>"

// same pattern, different input string
System.out.println(
    "OhMyGod=OhMyGodOhOhOh"
    .replaceAll("^(.*)(.*)(.*)=(\\1|\\2|\\3)+$", "<$1><$2><$3>")
); // prints "<Oh><MyGod><>"

// now \2 is reluctant
System.out.println(
    "OhMyGod=OhMyGodOhOhOh"
    .replaceAll("^(.*)(.*?)(.*)=(\\1|\\2|\\3)+$", "<$1><$2><$3>")
); // prints "<Oh><><MyGod>"
like image 838
polygenelubricants Avatar asked Apr 02 '10 09:04

polygenelubricants


2 Answers

\1 will have priority, \2 and \3 will always match nothing. \2 will then have priority over \3.

As a general rule think of it like this, back-tracking will only occur to satisfy a match, it will not occur to satisfy greediness, so left is best :)

explaining back tracking and greediness is to much for me to tackle here, i'd suggest friedl's Mastering Regular Expressions

like image 119
Paul Creasey Avatar answered Sep 21 '22 14:09

Paul Creasey


The addition of your concrete examples changes the nature of the question drastically. It still starts out as I described in my first answer, with the first (.*) gobbling up all the characters, and the second and third groups letting it have them, but then it has to match an equals sign.

Obviously there isn't one at the end of the string, so group #1 gives back characters one by one until the = in the regex can match the = in the target. Then the regex engine starts trying to match (\1|\2|\3)+$ and the real fun starts.

Group 1 gives up the d and group 2 (which is still empty) takes it, but the rest of the regex still can't match. Group 1 gives up the o and group 2 matches od, but the rest of the regex still can't match. And so it goes, with the third group getting involved, and the three of them slicing up the input in every way possible until an overall match is achieved. RegexBuddy reports that it takes 13,426 steps to get there.

In the first example, greediness (or lack of it) isn't really a factor; the only way a match can be achieved is if the words Oh, My and God are captured in separate groups, so eventually that's what happens. It doesn't even matter which group captures which word--that's just first come, first served, as I said before.

In the second and third examples it's only necessary to break the prefix into two chunks: Oh and MyGod. Group 2 captures MyGod in the second example because it's next in line and it's greedy, just like in the first example. In the third example, every time group 1 drops a character, group 2 (being reluctant) lets group 3 take it instead, so that's the one that ends up in possession of MyGod.

It's more complicated (and tedious) than that, of course, but I hope this answers your question. And I have to say, that's an interesting target string you chose; if it were possible for a regex engine to have an orgasm, I think these regexes would be the ones to bring it off. :D

like image 38
Alan Moore Avatar answered Sep 19 '22 14:09

Alan Moore