Backtracking in regexp faster than expected

Tags:

According to perlre, the following code should take several seconds to execute:

$ time perl -E '$_="((()" . ("a") x 18;  say "ok" if m{ \(([^()]+|\( [^()]* \))+\)}x;'

real    0m0.006s
user    0m0.000s
sys 0m0.005s

The documentation says:

Consider how the pattern above detects no-match on ((()aaaaaaaaaaaaaaaaaa in several seconds, but that each extra letter doubles this time.

As seen, it only takes a fraction of a second on my laptop.. Even running with a million a's completes in half a second:

$ time perl -E '$_="((()" . ("a") x 1000000;  say "ok" if m{ \(([^()]+|\( [^()]* \))+\)}x;'

real    0m0.454s
user    0m0.446s
sys 0m0.008s

What am I missing here?

748

asked Aug 20 '15 07:08

Håkon Hægland

2 Answers

One of the tricks you can do to figure out what the regex engine is doing, is:

use re 'debug';

E.g.:

use strict;
use warnings;
use re 'debug'; 

my $str = "a" x 18;

$str =~ m{ \(([^()]+|\( [^()]* \))+\)}x;

This will then print what the regex engine is actually doing:

Compiling REx " \(([^()]+|\( [^()]* \))+\)"
Final program:
   1: EXACT <(> (3)
   3: CURLYX[0] {1,32767} (40)
   5:   OPEN1 (7)
   7:     BRANCH (20)
   8:       PLUS (37)
   9:         ANYOF[^()][{above_bitmap_all}] (0)
  20:     BRANCH (FAIL)
  21:       EXACT <(> (23)
  23:       STAR (35)
  24:         ANYOF[^()][{above_bitmap_all}] (0)
  35:       EXACT <)> (37)
  37:   CLOSE1 (39)
  39: WHILEM[1/3] (0)
  40: NOTHING (41)
  41: EXACT <)> (43)
  43: END (0)
anchored "(" at 0 floating ")" at 2..2147483647 (checking floating) minlen 3 
Matching REx " \(([^()]+|\( [^()]* \))+\)" against "aaaaaaaaaaaaaaaaaa"
Intuit: trying to determine minimum start position...
  doing 'check' fbm scan, [2..18] gave -1
  Did not find floating substr ")"...
Match rejected by optimizer
Freeing REx: " \(([^()]+|\( [^()]* \))+\)"

If you add your brackets back in, you get a different result - I get around 2000 steps to process the regex. This gets longer with each additional letter - about 300 steps.

So I would say yes - catastrophic backtracking is occurring, but you may well find that processors (and regex engine optimisations) mean the time is considerably shorter.

use strict;
use warnings;
use re 'debug'; 

my $str = "((()"."a" x 100_000;
$str =~ m{ \(([^()]+|\( [^()]* \))+\)}x;

Runs considerably longer - but at least a proportion of it is because printing text to the screen is actually quite 'expensive' comparatively.

My test suggests (number of "a"s)

10 : 1221 lines of output (broadly correlated with regex steps)
11 : 1324 (+103)
12 : 1467 (+143)
13 : 1590 (+129)
14 : 1728 (+138)
15 : 1852 (+124)

20 : 2630 (approx +155/extra)
30 : 4536 (+190/extra)
40 : 6940 (+240/extra)
50 : 9846 (+290/extra)

100 - 31,846 (+440/extra letter)

So certainly looks like exponential behaviour going on - but in neither case, the processing time is still pretty quick, which I'd attribute to faster cpus (and maybe better optimisation of the regex engine)

194

answered Oct 17 '22 04:10

Sobrique

Consider how the pattern above detects no-match on ((()aaaaaaaaaaaaaaaaaa in several seconds.

This sentence dates back to at least April 2001, when perl 5.6.1 was released. You can see the perlre man page for that release at search.cpan.org/~gsar/perl-5.6.1/pod/perlre.pod.

This is perhaps a lesson to be learned here in documenting software. Be careful what you write as your written words will remain unchanged despite years of improvements and multiple forks by others.

answered Oct 17 '22 04:10

David Hammen

Related questions
                            
                                SQLite in C and supporting REGEXP
                            
                                How to Replace Text while Maintaining Capitalization?
                            
                                Regex and escaped and unescaped delimiter
                            
                                Refactoring large/old CSS files
                            
                                Auto-link regular expression
                            
                                Regular expression to split long strings in several lines
                            
                                Are these regex patterns different?
                            
                                How to go to the last match of a Vim search pattern
                            
                                How to grep for a URL in a file?
                            
                                How many times one string contains another [duplicate]
                            
                                how to validate internationalized domain names [closed]
                            
                                python re.X vs automagic line continuation
                            
                                Can you search backwards from an offset using a Python regular expression?
                            
                                Regex only allow alphanumeric
                            
                                LIKE '[charlist]%' syntax not working in MySQL (phpMyAdmin)
                            
                                regular expressions : (.*), (.*?) and .* [duplicate]
                            
                                Can I use python 're' to parse complex human names?
                            
                                Regex: parsing GitHub usernames (JavaScript)
                            
                                highlight text with surrounding words
                            
                                How can one find an item after the match using streams?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Backtracking in regexp faster than expected

Tags:

regex

perl

backtracking

Håkon Hægland

People also ask

2 Answers

Sobrique

David Hammen

Recent Activity

Donate For Us