Problem and Data
At the bottom of this post is the entire script from which this NYTProf data was generated. The script builds a hash and then attempts to delete keys that contain certain bad pattern. Running the code through NYTProf generates the following
delete @$hash{ grep { /\Q$bad_pattern\E/ } sort keys %$hash };
# spent 7.29ms making 2 calls to main::CORE:sort, avg 3.64ms/call
# spent 808µs making 7552 calls to main::CORE:match, avg 107ns/call
# spent 806µs making 7552 calls to main::CORE:regcomp, avg 107ns/call
There are over 7,000 calls being made to main::CORE:match and main::CORE:regcomp. The assumption is that this is a sufficient amount of calls to reduce noise levels.
Moving on! The bad patterns only need to be deleted if they appear at the beginning of a key. Sounds great! Adding a ^ to anchor the regex should improve performance. However, NYTProf generates the following. NYTprof has been run many times and this is quite consistent
delete @$hash{ grep { /^\Q$bad_pattern\E/ } sort keys %$hash };
# spent 7.34ms making 2 calls to main::CORE:sort, avg 3.67ms/call
# spent 1.62ms making 7552 calls to main::CORE:regcomp, avg 214ns/call
# spent 723µs making 7552 calls to main::CORE:match, avg 96ns/call
Questions
The anchored regex nearly doubles the amount of time spent in these main::CORE:* methods. But an anchored regex should improve performance. What is unique about this dataset that causes the anchored regex to take so much additional time?
Entire Script
use strict;
use Devel::NYTProf;
my @states = qw(KansasCity MississippiState ColoradoMountain IdahoInTheNorthWest AnchorageIsEvenFurtherNorth);
my @cities = qw(WitchitaHouston ChicagoDenver);
my @streets = qw(DowntownMainStreetInTheCity CenterStreetOverTheHill HickoryBasketOnTheWall);
my @seasoncode = qw(8000S 8000P 8000F 8000W);
my @historycode = qw(7000S 7000P 7000F 7000W 7000A 7000D 7000G 7000H);
my @sides = qw(left right up down);
my $hash;
for my $state (@states) {
for my $city (@cities) {
for my $street (@streets) {
for my $season (@seasoncode) {
for my $history (@historycode) {
for my $side (@sides) {
$hash->{$state . '[0].' . $city . '[1].' . $street . '[2].' . $season . '.' . $history . '.' . $side} = 1;
}
}
}
}
}
}
sub CleanseHash {
my @bad_patterns = (
'KansasCity[0].WitchitaHouston[1].DowntownMainStreetInTheCity[2]',
'ColoradoMountain[0].ChicagoDenver[1].HickoryBasketOnTheWall[2].8000F'
);
for my $bad_pattern (@bad_patterns) {
delete @$hash{ grep { /^\Q$bad_pattern\E/ } sort keys %$hash };
}
}
DB::enable_profile();
CleanseHash();
DB::finish_profile();
It's very unlikely you can optimise the regex engine. If performance is your goal, though, you can concentrate on other parts of the code. For example, try this:
for my $bad_pattern (@bad_patterns) {
my $re = qr/^\Q$bad_pattern\E/;
delete @$hash{ grep /$re/, sort keys %$hash };
}
On my machine, it runs much faster (regardless of the presence of the anchor), because the expression form of grep doesn't have to create a scope and the complex compilation of the regex happens just once for each bad pattern.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With