Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regex and the o operator perl v5.10.1

Tags:

regex

perl

I am trying to optimize a script that will be running regex on every file in a specific directory tree. All of the components of are working as they should but I am trying to get the regex to run as fast as possible.

The script is running many regexes on each file. We are trying to do something like this:

We are starting with a YAML file:

---
-
  description: has foo
  regex: foo
-
  description: has bar
  regex: bar
-
  description: has foofoo
  regex: foofoo
-
  description: has barbar
  regex: barbar

Then we read the file into an array (and run the regex strings through qr// to compile them) like this:

my @regex = @{LoadFile('yaml_file')};
foreach ( @regex ) { $_->{'regex'} = qr/$_->{'regex'}/ } 

Then evaluate each of the regexes on every file like this

foreach my $r ( @regex ) {
    if ( $slurped_file_text =~ /$r->{'regex'}/ ){
        stuff;
    }
}

What we have found is that the above method is much much slower than just expanded if/elsif statements like this:

if( $slurped_file_text =~ /foo/ ){
    stuff;
}elsif( $slurped_file_text =~ /bar/ ){
    stuff;
}elsif( $slurped_file_text =~ /foofoo/ ){
    stuff;
}elsif( $slurped_file_text =~ /barbar/ ){
    stuff;
}

But, this if/elsif method is not DRY and we need the ability to easily add regex to our list without having to edit the script code every time.

After looking at the NYTProf of foreach way of doing things it showed that there was a significant amount of time that is being spent calling main::CORE::regcomp.

After reading up about similar issues, we found the o operator which is supposed to signify that the regex has not changed since compilation so it doesnt need to be recompiled. So then we tried this (basically just adding the o to the top code):

foreach my $r ( @regex ) {
    if ( $slurped_file_text =~ /$r->{'regex'}/o ){
        stuff;
    }
}

And this gave us the speed we desired but it is not evaluating the regex correctly. It is not returning true when matching patterns exist.

I know that the o operator is not largely used anymore but, as stated above, we are still using perl v5.10.1 and the documentation for this version suggests that the o operator is needed for us to get the performance we are looking for.

My questions are these:

  • How can we get the regex using the o operator to evaluate correctly? Or is there anything you know about the o operator that might explain what is going on here?
  • Do you see any more efficient ways of running a dynamic list of regexes on a set of files.

Any and all help is very much appreciated.

like image 635
nmajor Avatar asked Nov 20 '25 05:11

nmajor


1 Answers

The reason the loop version is slower than the version where you unrolled the loop is because always check every regex in the loop version, but you only check until you find a match in the unrolled version.

my @regexs = map $_->{regex}, @{ LoadFile('yaml_file') };

for my $text (@texts) {
   for my $regex (@regexs) {
      if ($text =~ $regex) {
         stuff;
         last;       <---- Missing
      }
   }
}

But since you don't appear to care which pattern matched, you should just build one pattern and compile it.

my $pattern = join '|', map "(?:$_->{regex})", @{ LoadFile('yaml_file') };
my $regex = qr/$pattern/;

for my $text (@texts) {
   if ($text =~ $regex) {
      stuff;
   }
}
like image 175
ikegami Avatar answered Nov 22 '25 01:11

ikegami



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!