Perl anchored regex performance

Question

Problem and Data

At the bottom of this post is the entire script from which this NYTProf data was generated. The script builds a hash and then attempts to delete keys that contain certain bad pattern. Running the code through NYTProf generates the following

delete @$hash{ grep { /\Q$bad_pattern\E/ } sort keys %$hash };
# spent  7.29ms making    2 calls to main::CORE:sort, avg 3.64ms/call
# spent   808µs making 7552 calls to main::CORE:match, avg 107ns/call
# spent   806µs making 7552 calls to main::CORE:regcomp, avg 107ns/call

There are over 7,000 calls being made to main::CORE:match and main::CORE:regcomp. The assumption is that this is a sufficient amount of calls to reduce noise levels.

Moving on! The bad patterns only need to be deleted if they appear at the beginning of a key. Sounds great! Adding a ^ to anchor the regex should improve performance. However, NYTProf generates the following. NYTprof has been run many times and this is quite consistent

delete @$hash{ grep { /^\Q$bad_pattern\E/ } sort keys %$hash };
# spent  7.34ms making    2 calls to main::CORE:sort, avg 3.67ms/call
# spent  1.62ms making 7552 calls to main::CORE:regcomp, avg 214ns/call
# spent   723µs making 7552 calls to main::CORE:match, avg 96ns/call

Questions

The anchored regex nearly doubles the amount of time spent in these main::CORE:* methods. But an anchored regex should improve performance. What is unique about this dataset that causes the anchored regex to take so much additional time?

Entire Script

use strict;
use Devel::NYTProf;

my @states = qw(KansasCity MississippiState ColoradoMountain IdahoInTheNorthWest AnchorageIsEvenFurtherNorth);
my @cities = qw(WitchitaHouston ChicagoDenver);
my @streets = qw(DowntownMainStreetInTheCity CenterStreetOverTheHill HickoryBasketOnTheWall);
my @seasoncode = qw(8000S 8000P 8000F 8000W);
my @historycode = qw(7000S 7000P 7000F 7000W 7000A 7000D 7000G 7000H);
my @sides = qw(left right up down);

my $hash;
for my $state (@states) {
    for my $city (@cities) {
        for my $street (@streets) {
            for my $season (@seasoncode) {
                for my $history (@historycode) {
                    for my $side (@sides) {
                        $hash->{$state . '[0].' . $city . '[1].' . $street . '[2].' . $season . '.' . $history . '.' . $side} = 1;
                    }
                }
            }
        }
    }
}

sub CleanseHash {
    my @bad_patterns = (
        'KansasCity[0].WitchitaHouston[1].DowntownMainStreetInTheCity[2]',
        'ColoradoMountain[0].ChicagoDenver[1].HickoryBasketOnTheWall[2].8000F'
    );

    for my $bad_pattern (@bad_patterns) {
        delete @$hash{ grep { /^\Q$bad_pattern\E/ } sort keys %$hash };
    }
}

DB::enable_profile();
CleanseHash();
DB::finish_profile();

choroba · Accepted Answer

It's very unlikely you can optimise the regex engine. If performance is your goal, though, you can concentrate on other parts of the code. For example, try this:

for my $bad_pattern (@bad_patterns) {
    my $re = qr/^\Q$bad_pattern\E/;
    delete @$hash{ grep /$re/, sort keys %$hash };
}

On my machine, it runs much faster (regardless of the presence of the anchor), because the expression form of grep doesn't have to create a scope and the complex compilation of the regex happens just once for each bad pattern.

Perl anchored regex performance

Tags:

performance

regex

profiling

perl

user1333371

1 Answers

choroba

Recent Activity

Donate For Us

Perl anchored regex performance

Tags:

performance

regex

profiling

perl

user1333371

1 Answers

choroba

Related questions

Recent Activity

Donate For Us