I've been using perl for a decade. But lately I've got confused with using .*? regex.
It does not seem to match the minimum number of characters. Sometimes it gives different results.
For example for this string:aaaaaaaaaaaaaaaaaaaaaaammmmmmmmmmmbaaaaaaaaaaaaaaaaaaaaaab and pattern: a.*?b it matches complete input string in two groups. As per the definition it should have matched the last "ab".
(. *?) matches any character ( . ) any number of times ( * ), as few times as possible to make the regex match ( ? ). You'll get a match on any string, but you'll only capture a blank string because of the question mark.
represents any single character (usually excluding the newline character), while * is a quantifier meaning zero or more of the preceding regex atom (character or group). ? is a quantifier meaning zero or one instances of the preceding atom, or (in regex variants that support it) a modifier that sets the quantifier ...
\s stands for “whitespace character”. Again, which characters this actually includes, depends on the regex flavor. In all flavors discussed in this tutorial, it includes [ \t\r\n\f]. That is: \s matches a space, a tab, a carriage return, a line feed, or a form feed.
Regular expressions (shortened as "regex") are special strings representing a pattern to be matched in a search operation. They are an important tool in a wide variety of computing applications, from programming languages like Java and Perl, to text processing tools like grep, sed, and the text editor vim.
It doesn't cause a.*?b
to match the fewest characters possible; it causes .*
to match the fewest characters possible. Since it only affects .*
, it has no effect on what's already been matched (i.e. by a
).
Example shortened to:
#01234
'aaab' =~ /a.*?b/
What happens:
a
matches 1 character (a
)..*?
matches 0 characters (empty string).b
fails to match. ⇒ backtrack.*?
matches 1 character (a
).b
fails to match. ⇒ backtrack.*?
matches 2 characters (aa
).b
matches 1 character (b
)As you can see, it tried to match zero characters, which is obviously the smallest possible match. But the overall pattern failed to match when it did so, so larger and larger matches were tried until the overall pattern matched.
I try to avoid the non-greedy modifier.
'aaab' =~ /a[^a]*b/
If a
is really something more complex, then one can use a negative lookahead.
'aaab' =~ /a(?:(?!a).)*b/
It means
. # match any character except newlines
* # zero or more times
? # matching as few characters as possible
So in
<tag> text </tag> more text <tag> even more text </tag>
the regex <tag>(.*)</tag>
will match the entire string at once, capturing
text </tag> more text <tag> even more text
in backreference number 1.
If you match that with <tag>(.*?)</tag>
instead, you'll get two matches:
<tag> text </tag>
<tag> even more text </tag>
with only text
and even more text
being captured in backreference number 1, respectively.
And if (thanks Kobi!) your source text is
<tag> text <tag> nested text </tag> back to first level </tag>
then you'll find out that <tag>(.*)</tag>
matches the whole string again, but <tag>(.*?)</tag>
will match
<tag> text <tag> nested text </tag>
because the regex engine works from left to right. This is one of the reasons regular expressions are "not the best tool" for matching context-free grammars.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With