Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strange issue with regex matching in perl, alternate attempts match

Tags:

regex

perl

Consider the following perl script:

 #!/usr/bin/perl

 my $str = 'not-found=1,total-found=63,ignored=2';

 print "1. matched using regex\n" if ($str =~ m/total-found=(\d+)/g);
 print "2. matched using regex\n" if ($str =~ m/total-found=(\d+)/g);
 print "3. matched using regex\n" if ($str =~ m/total-found=(\d+)/g);
 print "4. matched using regex\n" if ($str =~ m/total-found=(\d+)/g);

 print "Bye!\n";

The output after running this is:

1. matched using regex
3. matched using regex
Bye!

The same regex matches once and does not match immediately after. Any idea why the alternate attempts to match the same string with the same regex fail in perl?

Thanks!

like image 239
amitsaurav Avatar asked Apr 04 '13 17:04

amitsaurav


1 Answers

Here is the long explanation why your code doesn't work.

The /g modifier changes the behaviour of the regex to “global matching”. This will match all occurrences of the pattern in the string. However, how this matching is done depends on context. The two (main) contexts in Perl are list context (the plural) and scalar context (the singular).

In list context, a global regex match returns a list of all matched substrings, or a flat list of all matched captures:

my $_ = "foobaa";
my $regex = qr/[aeiou]/;

my @matches = /$regex/g; # match all vowels
say "@matches"; # "o o a a"

In scalar context, the match seems to return a perl boolean decribing whether the regex matched:

my $match = /$regex/g;
say $match; # "1" (on failure: the empty string)

However, the regex turned into an iterator. Each time the regex match is executed, the regex starts at the current position in the string, and tries to match. If it matches, it returns true. If the match fails, then

  • the match returns false, and
  • the current position in the string is set to the start.

Because the position in the string was reset, the next match will suceed again.

my $match;
say $match while $match = /$regex/g;
say "The match returned false, or the while loop would have go on forever";
say "But we can match again" if /$regex/g;

The second effect — resetting the position — can be cancelled with the additional /c flag.

The position in a string can be accessed with the pos function: pos($string) returns the current position, which can be set like pos($string) = 0.

The regex can also be anchored with the \G assertion at the current position, much like ^ anchores a regex at the start of the string.

This m//gc-style matching makes it easy to write a tokenizer:

my @tokens;
my $_ = "1, abc, 2 ";
TOKEN: while(pos($_) < length($_)) {
  /\G\s+/gc and next; # skip whitespace
  # if one of the following matches fails, the next token is tried
  if    (/\G(\d+)/gc) { push @tokens, [NUM => $1]}
  elsif (/\G,/gc    ) { push @tokens, ['COMMA'  ]}
  elsif (/\G(\w+)/gc) { push @tokens, [STR => $1]}
  else { last TOKEN } # break the loop only if nothing matched at this position.
}
say "[@$_]" for @tokens;

Output:

[NUM 1]
[COMMA]
[STR abc]
[COMMA]
[NUM 2]
like image 148
amon Avatar answered Sep 23 '22 10:09

amon