Why does adding one more alternative make my regex over 600 times slower?

Tags:

I noticed something weird while testing a simple Perl script that's supposed to filter out filenames beginning with certain prefixes.

Basically, I'm constructing a regex like this:

my $regex = join "|", map quotemeta, @prefixes; $regex = qr/^($regex)/;   # anchor the regex and precompile it

Here, in the scenario I originally tested, @prefixes consists of 32-character hexadecimal strings. What I found is that everything ran nice and smoothly up to 6,552 prefixes — but as soon as I tried 6,553, the execution time of the script jumped by a factor of over 25 (from 1.8 seconds to 51 seconds)!

I played around with it, and constructed the following benchmark. I originally used 32-character hex strings, like in my original program, but found that if I cut the length of the strings down to just 8 characters, the threshold value moved higher — to 16,383, in fact — while the slowdown factor got dramatically higher yet: the regexp with 16,383 alternatives is almost 650 times slower than the one with only 16,382!

#!/usr/bin/perl use strict; use warnings; use Benchmark qw(timethese cmpthese);  my $count = shift || 10;  our @items = map unpack("H8", pack "V", $_), 0..99999;  our $nA = 16382; our $reA = join "|", @items[1 .. $nA]; our $nB = 16383; our $reB = join "|", @items[1 .. $nB];  $_ = qr/^($_)/ for $reA, $reB;  # anchor and compile regex  my $results = timethese( $count, {     $nA => q{ our (@items, $nA, $reA); $nA == grep /$reA/, @items or die; },     $nB => q{ our (@items, $nB, $reB); $nB == grep /$reB/, @items or die; }, } ); cmpthese( $results );

Here are the results of running this benchmark on my laptop, using Perl (v5.18.2):

Benchmark: timing 10 iterations of 16382, 16383...      16382:  2 wallclock secs ( 1.79 usr +  0.00 sys =  1.79 CPU) @  5.59/s (n=10)      16383: 1163 wallclock secs (1157.28 usr +  2.70 sys = 1159.98 CPU) @  0.01/s (n=10)       s/iter  16383  16382 16383    116     --  -100% 16382  0.179 64703%     --

Note the 64,703% speed difference.

My original hunch, based on the numerical coincidence that 6553 ≈ 2¹⁶ / 10, was that this might've been some kind of an arbitrary limit within the Perl regex engine, or maybe that there there might be some kind of an array of 10-byte structs that was limited to 64 kB, or something. But the fact that the threshold number of alternatives depends on their length makes things more confusing.

(On the other hand, it's clearly not just about the length of the regex, either; the original regex with 6,553 32-byte alternatives was 2 + 6,553×33 = 216,251 bytes long, while the one with 16,383 8-byte alternatives is only 2 + 16,383×9 = 147,450 bytes long.)

What is causing this weird jump in regex matching time, and why does it happen at that specific point?

324

asked Mar 23 '15 18:03

Ilmari Karonen

1 Answers

For a long time, perl's TRIE optimization has not been applied where the initial compilation of the regex produces longjmp instead of jmp (I think) instructions (which depends on the number of alternations and the total lengths of the strings involved and what else is (earlier?) in the regex).

See the difference between:

perl -we'use re "debug"; qr/@{[join"|","a".."afhd"]}/'

and

perl -we'use re "debug"; qr/@{[join"|","a".."afhe"]}/'

You can break your alternation down into smaller chunks and precompile them separately and do e.g. (??{$rxa})|(??{$rxb})|(??{$rxc}).

183

answered Oct 02 '22 19:10

ysth

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does adding one more alternative make my regex over 600 times slower?

Tags:

Ilmari Karonen

People also ask

1 Answers

ysth

Recent Activity

Donate For Us

Why does adding one more alternative make my regex over 600 times slower?

Tags:

Ilmari Karonen

People also ask

1 Answers

ysth

Related questions

Recent Activity

Donate For Us