Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is the star quantifier greedier than the plus quantifier in Java regular expressions?

I have text I'm trying to extract from LogicalID and SupplyChain from

 <LogicalID>SupplyChain</Logical>

At first I used the following regex:

.*([A-Za-z]+)>([A-Za-z]+)<.*

This matched as follows:

["D", "SupplyChain"]

In a fit of desperation, I tried using the asterisk instead of the plus:

.*([A-Za-z]*)>([A-Za-z]+)<.*

This matched perfectly.

The documentation says * matches zero or more times and + matches one or more times. Why is * greedier than +?

EDIT: It's been pointed out to me that this isn't the case below. The order of operations explains why the first match group is actually null.

like image 325
duber Avatar asked Dec 09 '13 17:12

duber


People also ask

Is * greedy in regex?

The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient. Instead of matching till the first occurrence of '>', it extracted the whole string.

What does the quantifier represent in regex Java?

If you want to specify the number of occurrences while constructing a regular expression you can use quantifiers. Java supports three types of quantifiers namely: greedy quantifiers, reluctant quantifiers and possessive quantifiers.

How do you stop greedy regex?

You make it non-greedy by using ". *?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ". *?" . This means that if for instance nothing comes after the ".

What is a greedy quantifier?

Greedy Quantifier (Default) Greedy quantifiers work by first reading the entire string before trying any match. If the whole text doesn't match, remove the last character and try again, repeating the process until a match is found. Java.


1 Answers

It's not a difference in greediness. In your first regex:

.*([A-Za-z]+)>([A-Za-z]+)<.*

You are asking for any amount of characters (.*), then at least a letter, then a >. So the greedy match has to be D, since * consumes everything before D.

In the second one, instead:

.*([A-Za-z]*)>([A-Za-z]+)<.*

You want any amount of characters, followed by any amount of letters, then the >. So the first * consumes everything up to the >, and the first capture group matches an empty string. I don't think that it "matches perfectly" at all.

like image 65
Aioros Avatar answered Oct 24 '22 00:10

Aioros