Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does .*? regular expression actually means?

Tags:

regex

perl

I've been using perl for a decade. But lately I've got confused with using .*? regex.

It does not seem to match the minimum number of characters. Sometimes it gives different results.

For example for this string:aaaaaaaaaaaaaaaaaaaaaaammmmmmmmmmmbaaaaaaaaaaaaaaaaaaaaaab and pattern: a.*?b it matches complete input string in two groups. As per the definition it should have matched the last "ab".

like image 502
AgA Avatar asked Mar 23 '11 06:03

AgA


People also ask

What does .*? Mean in regex?

(. *?) matches any character ( . ) any number of times ( * ), as few times as possible to make the regex match ( ? ). You'll get a match on any string, but you'll only capture a blank string because of the question mark.

What is the difference between .*? And * regular expressions?

represents any single character (usually excluding the newline character), while * is a quantifier meaning zero or more of the preceding regex atom (character or group). ? is a quantifier meaning zero or one instances of the preceding atom, or (in regex variants that support it) a modifier that sets the quantifier ...

What does \s mean in regex?

\s stands for “whitespace character”. Again, which characters this actually includes, depends on the regex flavor. In all flavors discussed in this tutorial, it includes [ \t\r\n\f]. That is: \s matches a space, a tab, a carriage return, a line feed, or a form feed.

What does * regular expression represent?

Regular expressions (shortened as "regex") are special strings representing a pattern to be matched in a search operation. They are an important tool in a wide variety of computing applications, from programming languages like Java and Perl, to text processing tools like grep, sed, and the text editor vim.


2 Answers

It doesn't cause a.*?b to match the fewest characters possible; it causes .* to match the fewest characters possible. Since it only affects .*, it has no effect on what's already been matched (i.e. by a).

Example shortened to:

#01234
'aaab' =~ /a.*?b/

What happens:

  1. At pos 0, a matches 1 character (a).
  2. At pos 1, .*? matches 0 characters (empty string).
  3. At pos 1, b fails to match. ⇒ backtrack
  4. At pos 1, .*? matches 1 character (a).
  5. At pos 2, b fails to match. ⇒ backtrack
  6. At pos 1, .*? matches 2 characters (aa).
  7. At pos 3, b matches 1 character (b)
  8. Pattern match successful.

As you can see, it tried to match zero characters, which is obviously the smallest possible match. But the overall pattern failed to match when it did so, so larger and larger matches were tried until the overall pattern matched.


I try to avoid the non-greedy modifier.

'aaab' =~ /a[^a]*b/

If a is really something more complex, then one can use a negative lookahead.

'aaab' =~ /a(?:(?!a).)*b/
like image 111
ikegami Avatar answered Nov 11 '22 05:11

ikegami


It means

.   # match any character except newlines
*   # zero or more times
?   # matching as few characters as possible

So in

<tag> text </tag> more text <tag> even more text </tag>

the regex <tag>(.*)</tag> will match the entire string at once, capturing

 text </tag> more text <tag> even more text 

in backreference number 1.

If you match that with <tag>(.*?)</tag> instead, you'll get two matches:

  1. <tag> text </tag>
  2. <tag> even more text </tag>

with only text and even more text being captured in backreference number 1, respectively.

And if (thanks Kobi!) your source text is

<tag> text <tag> nested text </tag> back to first level </tag>

then you'll find out that <tag>(.*)</tag> matches the whole string again, but <tag>(.*?)</tag> will match

<tag> text <tag> nested text </tag>

because the regex engine works from left to right. This is one of the reasons regular expressions are "not the best tool" for matching context-free grammars.

like image 22
Tim Pietzcker Avatar answered Nov 11 '22 03:11

Tim Pietzcker