I've been using perl for a decade. But lately I've got confused with using .*? regex. It does not seem to match the minimum number of characters. Sometimes it gives different results. For example for this string:aaaaaaaaaaaaaaaaaaaaaaammmmmmmmmmmbaaaaaaaaaaaaaaaaaaaaaab and pattern: a.*?b it matches complete input string in two groups. As per the definition it should have matched the last "ab".

It doesn't cause <code>a.*?b</code> to match the fewest characters possible; it causes <code>.*</code> to match the fewest characters possible. Since it only affects <code>.*</code>, it has no effect on what's already been matched (i.e. by <code>a</code>). Example shortened to: <pre class="prettyprint"><code>#01234 'aaab' =~ /a.*?b/ </code></pre> What happens: <ol> <li>At pos 0, <code>a</code> matches 1 character (<code>a</code>).</li> <li>At pos 1, <code>.*?</code> matches 0 characters (empty string).</li> <li>At pos 1, <code>b</code> fails to match. ⇒ backtrack</li> <li>At pos 1, <code>.*?</code> matches 1 character (<code>a</code>).</li> <li>At pos 2, <code>b</code> fails to match. ⇒ backtrack</li> <li>At pos 1, <code>.*?</code> matches 2 characters (<code>aa</code>).</li> <li>At pos 3, <code>b</code> matches 1 character (<code>b</code>)</li> <li>Pattern match successful.</li> </ol> As you can see, it tried to match zero characters, which is obviously the smallest possible match. But the overall pattern failed to match when it did so, so larger and larger matches were tried until the overall pattern matched. <hr> I try to avoid the non-greedy modifier. <pre class="prettyprint"><code>'aaab' =~ /a[^a]*b/ </code></pre> If <code>a</code> is really something more complex, then one can use a negative lookahead. <pre class="prettyprint"><code>'aaab' =~ /a(?:(?!a).)*b/ </code></pre>

It means <pre class="prettyprint"><code>. # match any character except newlines * # zero or more times ? # matching as few characters as possible </code></pre> So in <pre class="prettyprint"><code><tag> text </tag> more text <tag> even more text </tag> </code></pre> the regex <code><tag>(.*)</tag></code> will match the entire string at once, capturing <pre class="prettyprint"><code> text </tag> more text <tag> even more text </code></pre> in backreference number 1. If you match that with <code><tag>(.*?)</tag></code> instead, you'll get two matches: <ol> <li><code><tag> text </tag></code></li> <li><code><tag> even more text </tag></code></li> </ol> with only <code>text</code> and <code>even more text</code> being captured in backreference number 1, respectively. And if (thanks Kobi!) your source text is <pre class="prettyprint"><code><tag> text <tag> nested text </tag> back to first level </tag> </code></pre> then you'll find out that <code><tag>(.*)</tag></code> matches the whole string again, but <code><tag>(.*?)</tag></code> will match <pre class="prettyprint"><code><tag> text <tag> nested text </tag> </code></pre> because the regex engine works from left to right. This is one of the reasons regular expressions are "not the best tool" for matching context-free grammars.

What does .*? regular expression actually means?

Tags:

regex

perl

I've been using perl for a decade. But lately I've got confused with using .*? regex.

It does not seem to match the minimum number of characters. Sometimes it gives different results.

For example for this string:aaaaaaaaaaaaaaaaaaaaaaammmmmmmmmmmbaaaaaaaaaaaaaaaaaaaaaab and pattern: a.*?b it matches complete input string in two groups. As per the definition it should have matched the last "ab".

502

asked Mar 23 '11 06:03

AgA

2 Answers

It doesn't cause a.*?b to match the fewest characters possible; it causes .* to match the fewest characters possible. Since it only affects .*, it has no effect on what's already been matched (i.e. by a).

Example shortened to:

#01234
'aaab' =~ /a.*?b/

What happens:

At pos 0, a matches 1 character (a).
At pos 1, .*? matches 0 characters (empty string).
At pos 1, b fails to match. ⇒ backtrack
At pos 1, .*? matches 1 character (a).
At pos 2, b fails to match. ⇒ backtrack
At pos 1, .*? matches 2 characters (aa).
At pos 3, b matches 1 character (b)
Pattern match successful.

As you can see, it tried to match zero characters, which is obviously the smallest possible match. But the overall pattern failed to match when it did so, so larger and larger matches were tried until the overall pattern matched.

I try to avoid the non-greedy modifier.

'aaab' =~ /a[^a]*b/

If a is really something more complex, then one can use a negative lookahead.

'aaab' =~ /a(?:(?!a).)*b/

111

answered Nov 11 '22 05:11

ikegami

It means

.   # match any character except newlines
*   # zero or more times
?   # matching as few characters as possible

So in

<tag> text </tag> more text <tag> even more text </tag>

the regex <tag>(.*)</tag> will match the entire string at once, capturing

 text </tag> more text <tag> even more text

in backreference number 1.

If you match that with <tag>(.*?)</tag> instead, you'll get two matches:

<tag> text </tag>
<tag> even more text </tag>

with only text and even more text being captured in backreference number 1, respectively.

And if (thanks Kobi!) your source text is

<tag> text <tag> nested text </tag> back to first level </tag>

then you'll find out that <tag>(.*)</tag> matches the whole string again, but <tag>(.*?)</tag> will match

<tag> text <tag> nested text </tag>

because the regex engine works from left to right. This is one of the reasons regular expressions are "not the best tool" for matching context-free grammars.

answered Nov 11 '22 03:11

Tim Pietzcker

Related questions
                            
                                Detecting iOS Version Number from User Agent using Regular Expressions
                            
                                To check if a string is alphanumeric in javascript
                            
                                Javascript regex to match fully qualified domain name, without protocol, optional subdomain
                            
                                Split String By Character
                            
                                python: how to find consecutive pairs of letters by regex?
                            
                                split(/\s+/).pop() - what does it do?
                            
                                Count parentheses with regular expression
                            
                                How to match start or end of given string using regex in java
                            
                                python group(0) meaning
                            
                                Regex vs brute-force for small strings
                            
                                Python 3.7.4: 're.error: bad escape \s at position 0'
                            
                                Is there a Perl equivalent of Python's re.findall/re.finditer (iterative regex results)?
                            
                                Cannot redirect with the response.sendRedirect
                            
                                What is the scope of $1 through $9 in Perl?
                            
                                How can I replace intraline tabs with spaces, maintaining alignment?
                            
                                Regex that matches Camel and Pascal Case
                            
                                Using Regex to remove script tags
                            
                                find all text before using regex
                            
                                Get filename from URL using Regular Expressions or Javascript
                            
                                awk print matching line and line before the matched

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With