* means zero-or-more, and + means one-or-more. So the difference is that the empty string would match the second expression but not the first.
(. *?) matches any character ( . ) any number of times ( * ), as few times as possible to make the regex match ( ? ). You'll get a match on any string, but you'll only capture a blank string because of the question mark.
The Match-zero-or-more Operator ( * ) This operator repeats the smallest possible preceding regular expression as many times as necessary (including zero) to match the pattern.
So, a+b means [ab] or a|b , thus (a+b)* means any string of length 0 or more, containing any number of a s and b s in any order. Likewise, (a*b*)* also means any string of length 0 or more, containing any number of a s and b s in any order.
Repetition in regex by default is greedy: they try to match as many reps as possible, and when this doesn't work and they have to backtrack, they try to match one fewer rep at a time, until a match of the whole pattern is found. As a result, when a match finally happens, a greedy repetition would match as many reps as possible.
The ?
as a repetition quantifier changes this behavior into non-greedy, also called reluctant (in e.g. Java) (and sometimes "lazy"). In contrast, this repetition will first try to match as few reps as possible, and when this doesn't work and they have to backtrack, they start matching one more rept a time. As a result, when a match finally happens, a reluctant repetition would match as few reps as possible.
Let's compare these two patterns: A.*Z
and A.*?Z
.
Given the following input:
eeeAiiZuuuuAoooZeeee
The patterns yield the following matches:
A.*Z
yields 1 match: AiiZuuuuAoooZ
(see on rubular.com)A.*?Z
yields 2 matches: AiiZ
and AoooZ
(see on rubular.com)Let's first focus on what A.*Z
does. When it matched the first A
, the .*
, being greedy, first tries to match as many .
as possible.
eeeAiiZuuuuAoooZeeee
\_______________/
A.* matched, Z can't match
Since the Z
doesn't match, the engine backtracks, and .*
must then match one fewer .
:
eeeAiiZuuuuAoooZeeee
\______________/
A.* matched, Z still can't match
This happens a few more times, until finally we come to this:
eeeAiiZuuuuAoooZeeee
\__________/
A.* matched, Z can now match
Now Z
can match, so the overall pattern matches:
eeeAiiZuuuuAoooZeeee
\___________/
A.*Z matched
By contrast, the reluctant repetition in A.*?Z
first matches as few .
as possible, and then taking more .
as necessary. This explains why it finds two matches in the input.
Here's a visual representation of what the two patterns matched:
eeeAiiZuuuuAoooZeeee
\__/r \___/r r = reluctant
\____g____/ g = greedy
In many applications, the two matches in the above input is what is desired, thus a reluctant .*?
is used instead of the greedy .*
to prevent overmatching. For this particular pattern, however, there is a better alternative, using negated character class.
The pattern A[^Z]*Z
also finds the same two matches as the A.*?Z
pattern for the above input (as seen on ideone.com). [^Z]
is what is called a negated character class: it matches anything but Z
.
The main difference between the two patterns is in performance: being more strict, the negated character class can only match one way for a given input. It doesn't matter if you use greedy or reluctant modifier for this pattern. In fact, in some flavors, you can do even better and use what is called possessive quantifier, which doesn't backtrack at all.
This example should be illustrative: it shows how the greedy, reluctant, and negated character class patterns match differently given the same input.
eeAiiZooAuuZZeeeZZfff
These are the matches for the above input:
A[^Z]*ZZ
yields 1 match: AuuZZ
(as seen on ideone.com)A.*?ZZ
yields 1 match: AiiZooAuuZZ
(as seen on ideone.com)A.*ZZ
yields 1 match: AiiZooAuuZZeeeZZ
(as seen on ideone.com)Here's a visual representation of what they matched:
___n
/ \ n = negated character class
eeAiiZooAuuZZeeeZZfff r = reluctant
\_________/r / g = greedy
\____________/g
These are links to questions and answers on stackoverflow that cover some topics that may be of interest.
It is the difference between greedy and non-greedy quantifiers.
Consider the input 101000000000100
.
Using 1.*1
, *
is greedy - it will match all the way to the end, and then backtrack until it can match 1
, leaving you with 1010000000001
..*?
is non-greedy. *
will match nothing, but then will try to match extra characters until it matches 1
, eventually matching 101
.
All quantifiers have a non-greedy mode: .*?
, .+?
, .{2,6}?
, and even .??
.
In your case, a similar pattern could be <([^>]*)>
- matching anything but a greater-than sign (strictly speaking, it matches zero or more characters other than >
in-between <
and >
).
See Quantifier Cheat Sheet.
Let's say you have:
<a></a>
<(.*)>
would match a></a
where as <(.*?)>
would match a
.
The latter stops after the first match of >
. It checks for one
or 0 matches of .*
followed by the next expression.
The first expression <(.*)>
doesn't stop when matching the first >
. It will continue until the last match of >
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With