Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl anchored regex performance

Problem and Data

At the bottom of this post is the entire script from which this NYTProf data was generated. The script builds a hash and then attempts to delete keys that contain certain bad pattern. Running the code through NYTProf generates the following

delete @$hash{ grep { /\Q$bad_pattern\E/ } sort keys %$hash };
# spent  7.29ms making    2 calls to main::CORE:sort, avg 3.64ms/call
# spent   808µs making 7552 calls to main::CORE:match, avg 107ns/call
# spent   806µs making 7552 calls to main::CORE:regcomp, avg 107ns/call

There are over 7,000 calls being made to main::CORE:match and main::CORE:regcomp. The assumption is that this is a sufficient amount of calls to reduce noise levels.

Moving on! The bad patterns only need to be deleted if they appear at the beginning of a key. Sounds great! Adding a ^ to anchor the regex should improve performance. However, NYTProf generates the following. NYTprof has been run many times and this is quite consistent

delete @$hash{ grep { /^\Q$bad_pattern\E/ } sort keys %$hash };
# spent  7.34ms making    2 calls to main::CORE:sort, avg 3.67ms/call
# spent  1.62ms making 7552 calls to main::CORE:regcomp, avg 214ns/call
# spent   723µs making 7552 calls to main::CORE:match, avg 96ns/call

Questions

The anchored regex nearly doubles the amount of time spent in these main::CORE:* methods. But an anchored regex should improve performance. What is unique about this dataset that causes the anchored regex to take so much additional time?

Entire Script

use strict;
use Devel::NYTProf;

my @states = qw(KansasCity MississippiState ColoradoMountain IdahoInTheNorthWest AnchorageIsEvenFurtherNorth);
my @cities = qw(WitchitaHouston ChicagoDenver);
my @streets = qw(DowntownMainStreetInTheCity CenterStreetOverTheHill HickoryBasketOnTheWall);
my @seasoncode = qw(8000S 8000P 8000F 8000W);
my @historycode = qw(7000S 7000P 7000F 7000W 7000A 7000D 7000G 7000H);
my @sides = qw(left right up down);

my $hash;
for my $state (@states) {
    for my $city (@cities) {
        for my $street (@streets) {
            for my $season (@seasoncode) {
                for my $history (@historycode) {
                    for my $side (@sides) {
                        $hash->{$state . '[0].' . $city . '[1].' . $street . '[2].' . $season . '.' . $history . '.' . $side} = 1;
                    }
                }
            }
        }
    }
}

sub CleanseHash {
    my @bad_patterns = (
        'KansasCity[0].WitchitaHouston[1].DowntownMainStreetInTheCity[2]',
        'ColoradoMountain[0].ChicagoDenver[1].HickoryBasketOnTheWall[2].8000F'
    );

    for my $bad_pattern (@bad_patterns) {
        delete @$hash{ grep { /^\Q$bad_pattern\E/ } sort keys %$hash };
    }
}

DB::enable_profile();
CleanseHash();
DB::finish_profile();
like image 320
user1333371 Avatar asked Dec 18 '22 11:12

user1333371


1 Answers

It's very unlikely you can optimise the regex engine. If performance is your goal, though, you can concentrate on other parts of the code. For example, try this:

for my $bad_pattern (@bad_patterns) {
    my $re = qr/^\Q$bad_pattern\E/;
    delete @$hash{ grep /$re/, sort keys %$hash };
}

On my machine, it runs much faster (regardless of the presence of the anchor), because the expression form of grep doesn't have to create a scope and the complex compilation of the regex happens just once for each bad pattern.

like image 93
choroba Avatar answered Dec 24 '22 04:12

choroba