Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why perl regex '*?' stay greedy?

Tags:

regex

perl

I run a simple program:

my $_ =  '/login/.htaccess/.htdf';
s!(/\.ht.*?)$!/!;
print "$_ $1";

OUT
/login/ /.htaccess/.htdf

I want this regex to match only /.htdf.

Example 2:

my $_ =  'abcbc';
m/(b.*?)$/;
print "$_ $1\n";

OUT
abcbc bcbc

I expect bc.

Why is *? still greedy? (I want the minimal match.)

like image 725
Eugen Konkov Avatar asked Aug 27 '15 15:08

Eugen Konkov


People also ask

Why is regex greedy?

In general, the regex engine will try to match as many input characters as possible once it encounters a quantified token like \d+ or, in our case, . * . That behavior is called greedy matching because the engine will eagerly attempt to match anything it can.

How do I stop regex greedy?

You make it non-greedy by using ". *?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ". *?" . This means that if for instance nothing comes after the ".

How do you make part of a Perl regular expression Non-Greedy?

To use non-greedy Perl-style regular expressions, the ? (question mark) may be added to the syntax, usually where the wildcard expression is used. In our above example, our wildcard character is the . * (period and asterisk). The period will match any character except a null (hex 00) or new line.

What is a greedy match in regex?

The standard quantifiers in regular expressions are greedy, meaning they match as much as they can, only giving back as necessary to match the remainder of the regex. By using a lazy quantifier, the expression tries the minimal match first.


1 Answers

Atoms are matched in sequence, and each atom after the first must match at the position where the previous atom left off matching. (The first atom is implicitly preceded by \A(?s:.)*?.) That means that .*/.*? doesn't get to decided where it starts matching; it only gets to decided where it stops matching.

Example 1

It's not being greedy. \.ht brings the match to position 10, and at position 10, the minimum .*? can match and still have the rest of the pattern match is access/.htdf. In fact, it's the only thing .*? can match at position 10 and still have the rest of the pattern match.

I think you want to remove that last part of the path if it starts with .ht, leaving the preceding / in place. For that, you can use either of the following:

s{/\.ht[^/]*$}{/}

or

s{/\K\.ht[^/]*$}{}

Example 2

It's not being greedy. b brings the match to position 2, and at position 2, the minimum .*? can match and still have the rest of the pattern match is cbc. In fact, it's the only thing .*? can match at position 2 and still have the rest of the pattern match.

You are probably looking for

/b[^b]*$/

or

/b(?:(?!b).)*$/    # You'd use this if "b" was really more than one char.
like image 168
ikegami Avatar answered Sep 30 '22 18:09

ikegami