Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does adding one more alternative make my regex over 600 times slower?

Tags:

I noticed something weird while testing a simple Perl script that's supposed to filter out filenames beginning with certain prefixes.

Basically, I'm constructing a regex like this:

my $regex = join "|", map quotemeta, @prefixes; $regex = qr/^($regex)/;   # anchor the regex and precompile it 

Here, in the scenario I originally tested, @prefixes consists of 32-character hexadecimal strings. What I found is that everything ran nice and smoothly up to 6,552 prefixes — but as soon as I tried 6,553, the execution time of the script jumped by a factor of over 25 (from 1.8 seconds to 51 seconds)!

I played around with it, and constructed the following benchmark. I originally used 32-character hex strings, like in my original program, but found that if I cut the length of the strings down to just 8 characters, the threshold value moved higher — to 16,383, in fact — while the slowdown factor got dramatically higher yet: the regexp with 16,383 alternatives is almost 650 times slower than the one with only 16,382!

#!/usr/bin/perl use strict; use warnings; use Benchmark qw(timethese cmpthese);  my $count = shift || 10;  our @items = map unpack("H8", pack "V", $_), 0..99999;  our $nA = 16382; our $reA = join "|", @items[1 .. $nA]; our $nB = 16383; our $reB = join "|", @items[1 .. $nB];  $_ = qr/^($_)/ for $reA, $reB;  # anchor and compile regex  my $results = timethese( $count, {     $nA => q{ our (@items, $nA, $reA); $nA == grep /$reA/, @items or die; },     $nB => q{ our (@items, $nB, $reB); $nB == grep /$reB/, @items or die; }, } ); cmpthese( $results ); 

Here are the results of running this benchmark on my laptop, using Perl (v5.18.2):

Benchmark: timing 10 iterations of 16382, 16383...      16382:  2 wallclock secs ( 1.79 usr +  0.00 sys =  1.79 CPU) @  5.59/s (n=10)      16383: 1163 wallclock secs (1157.28 usr +  2.70 sys = 1159.98 CPU) @  0.01/s (n=10)       s/iter  16383  16382 16383    116     --  -100% 16382  0.179 64703%     -- 

Note the 64,703% speed difference.

My original hunch, based on the numerical coincidence that 6553 ≈ 216 / 10, was that this might've been some kind of an arbitrary limit within the Perl regex engine, or maybe that there there might be some kind of an array of 10-byte structs that was limited to 64 kB, or something. But the fact that the threshold number of alternatives depends on their length makes things more confusing.

(On the other hand, it's clearly not just about the length of the regex, either; the original regex with 6,553 32-byte alternatives was 2 + 6,553×33 = 216,251 bytes long, while the one with 16,383 8-byte alternatives is only 2 + 16,383×9 = 147,450 bytes long.)

What is causing this weird jump in regex matching time, and why does it happen at that specific point?

like image 324
Ilmari Karonen Avatar asked Mar 23 '15 18:03

Ilmari Karonen


People also ask

How can I Make my regex fail faster?

If you start with the one that is least likely to match, the regex will fail faster. This one is a bit of a micro-optimization, but it can give you a decent boost depending on your use case, and it can’t possibly hurt because the two expressions are equivalent. I ran a benchmark on the following two regexes:

How much better is regex for performance?

Regex 1 (the .* one) Regex 2 (the character class one) Performance improvement Input 1 (matching) 3606ms 736ms 79.6% Input 2 (not matching) 591ms 225ms 61.9% Input 3 (almost matching) 2520ms 597ms 76.3% Here we can see that even with matching input, the vague dot starry regex takes way longer. In all cases, the specific regex performed way better.

How to reduce processing time with regular expression techniques?

Without further ado, here are five regular expression techniques that can dramatically reduce processing time: Character classes Possessive quantifiers (and atomic groups) Lazy quantifiers Anchors and boundaries Optimizing regex order Character Classes This is the most important thing to keep in mind when crafting performant regexes.

Does adding more terms to a linear model increase the R-squared?

Many statistics textbooks state that adding more terms into a linear model always reduces the sum of squares and in turn increases the r-squared value. This has led to the use of the adjusted r-squared.


1 Answers

For a long time, perl's TRIE optimization has not been applied where the initial compilation of the regex produces longjmp instead of jmp (I think) instructions (which depends on the number of alternations and the total lengths of the strings involved and what else is (earlier?) in the regex).

See the difference between:

perl -we'use re "debug"; qr/@{[join"|","a".."afhd"]}/' 

and

perl -we'use re "debug"; qr/@{[join"|","a".."afhe"]}/' 

You can break your alternation down into smaller chunks and precompile them separately and do e.g. (??{$rxa})|(??{$rxb})|(??{$rxc}).

like image 183
ysth Avatar answered Oct 02 '22 19:10

ysth