Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Global regex match in while() on backtick result

Tags:

regex

perl

This script searches for lines with words and prints them, while rereading source file in each iteration:

# cat mm.pl
#!/usr/bin/perl
use strict;
use warnings;

while( `cat aa` =~ /(\w+)/g ) {
    print "$1\n";
}

Input file:

# cat aa
aa
bb
cc

Result:

# ./mm.pl
aa
bb
cc

Please explain me why running the script isn't endless.

In every while iteration offset for regex engine should be reset because expression is changed (new cat is forked).

I thought perl does some kind of caching for cat result, but strace claims that cat was spawned 4 times (3 for 3 lines + 1 for false while condition):

# strace -f ./mm.pl 2>&1 | grep cat | grep -v ENOENT
[pid 22604] execve("/bin/cat", ["cat", "aa"], [/* 24 vars */] <unfinished ...>
[pid 22605] execve("/bin/cat", ["cat", "aa"], [/* 24 vars */] <unfinished ...>
[pid 22606] execve("/bin/cat", ["cat", "aa"], [/* 24 vars */] <unfinished ...>
[pid 22607] execve("/bin/cat", ["cat", "aa"], [/* 24 vars */] <unfinished ...>

On the other hand, following example runs forever:

# cat kk.pl
#!/usr/bin/perl
use strict;
use warnings;

my $d = 'aaa';
while( $d =~ /(\w+)/g ) {
    print "$1\n";
    $d = 'aaa';
}

Where is a difference between the two scripts? What am I missing?

like image 361
MrCricket Avatar asked Apr 06 '17 13:04

MrCricket


1 Answers

The position at which //g left off is stored in magic added to the scalar against which the matching was performed.

$ perl -MDevel::Peek -e'$_ = "abc"; Dump($_); /./g; Dump($_);'
SV = PV(0x32169a0) at 0x3253ee0
  REFCNT = 1
  FLAGS = (POK,IsCOW,pPOK)
  PV = 0x323bae0 "abc"\0
  CUR = 3
  LEN = 10
  COW_REFCNT = 1
SV = PVMG(0x326c040) at 0x3253ee0
  REFCNT = 1
  FLAGS = (SMG,POK,IsCOW,pPOK)
  IV = 0
  NV = 0
  PV = 0x323bae0 "abc"\0
  CUR = 3
  LEN = 10
  COW_REFCNT = 2
  MAGIC = 0x323d050
    MG_VIRTUAL = &PL_vtbl_mglob
    MG_TYPE = PERL_MAGIC_regex_global(g)
    MG_FLAGS = 0x40
      BYTES
    MG_LEN = 1

This means the only way the behaviour observed is possible in the backticks example is if the match operator matched against the same scalar all four times it was evaluated! How is that possible? It's because backticks is one of the operators that uses a TARG.

Creating a scalar is relatively expensive since it requires up to three memory allocations! In order to increase performance, a scalar called TARG is associated with each instance of some operators. When an operator with a TARG is evaluated, it may populate the TARG with the value to return and return the TARG (rather than allocating and returning a new one).

"So what?", you might ask. After all, you've already demonstrated that assigning to a scalar resets the match position associated with that scalar. That's what's suppose to happen, but it doesn't for backticks.

Magic not only allows information to be attached to a variable, it also attaches functions to be called under certain conditions. The magic added by //g attaches a function that should be called after the scalar is modified (which is indicated by the SMG flag in the dump above). This function is what clears the position when a value is assigned to the scalar.

The assignment operator handles the magic properly, but not by the backticks operator. It doesn't expect magic to have been added to its TARG, so it doesn't check if there's any, so the function that clears the match position goes uncalled.

like image 127
ikegami Avatar answered Oct 18 '22 15:10

ikegami