Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is Perl lazy when regex matching with * against a group?

Tags:

regex

perl

In perl, the * is usually greedy, unless you add a ? after it. When * is used against a group, however, the situation seems different. My question is "why". Consider this example:

my $text = 'f fjfj ff';
my (@matches) = $text =~ m/((?:fj)*)/;
print "@matches\n";
# --> ""
@matches = $text =~ m/((?:fj)+)/;
print "@matches\n";
# --> "fjfj"

In the first match, perl lazily prints out nothing, though it could have matched something, as is demonstrated in the second match. Oddly, the behavior of * is greedy as expected when the contents of the group is just . instead of actual characters:

@matches = $text =~ m/((?:..)*)/;
print "@matches\n";
# --> 'f fjfj f'
  1. Note: The above was tested on perl 5.12.
  2. Note: It doesn't matter whether I use capturing or non-capturing parentheses for inside group.
like image 368
Joshua Richardson Avatar asked Jul 09 '13 00:07

Joshua Richardson


People also ask

What is \s in Perl regex?

The Substitution Operator The substitution operator, s///, is really just an extension of the match operator that allows you to replace the text matched with some new text. The basic form of the operator is − s/PATTERN/REPLACEMENT/; The PATTERN is the regular expression for the text that we are looking for.

What does G mean in Perl?

The “g” stands for “global”, which tells Perl to replace all matches, and not just the first one. Options are typically indicated including the slash, like “/g”, even though you do not add an extra slash, and even though you could use any non-word character instead of slashes.

How do I match a number in Perl?

Digit \d[0-9]: The \d is used to match any digit character and its equivalent to [0-9]. In the regex /\d/ will match a single digit.


2 Answers

This isn't a matter of greedy or lazy repetition. (?:fj)* is greedily matching as many repetitions of "fj" as it can, but it will successfully match zero repetitions. When you try to match it against the string "f fjfj ff", it will first attempt to match at position zero (before the first "f"). The maximum number of times you can successfully match "fj" at position zero is zero, so the pattern successfully matches the empty string. Since the pattern successfully matched at position zero, we're done, and the engine has no reason to try a match at a later position.

The moral of the story is: don't write a pattern that can match nothing, unless you want it to match nothing.

like image 118
hobbs Avatar answered Oct 29 '22 20:10

hobbs


Perl will match as early as possible in the string (left-most). It can do that with your first match by matching zero occurrences of fj at the start of the string

like image 33
Adrian Pronk Avatar answered Oct 29 '22 21:10

Adrian Pronk