Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Backtracking in regexp faster than expected

According to perlre, the following code should take several seconds to execute:

$ time perl -E '$_="((()" . ("a") x 18;  say "ok" if m{ \(([^()]+|\( [^()]* \))+\)}x;'

real    0m0.006s
user    0m0.000s
sys 0m0.005s

The documentation says:

Consider how the pattern above detects no-match on ((()aaaaaaaaaaaaaaaaaa in several seconds, but that each extra letter doubles this time.

As seen, it only takes a fraction of a second on my laptop.. Even running with a million a's completes in half a second:

$ time perl -E '$_="((()" . ("a") x 1000000;  say "ok" if m{ \(([^()]+|\( [^()]* \))+\)}x;'

real    0m0.454s
user    0m0.446s
sys 0m0.008s

What am I missing here?

like image 748
Håkon Hægland Avatar asked Aug 20 '15 07:08

Håkon Hægland


People also ask

What is catastrophic backtracking in regex?

Catastrophic backtracking is a condition that can occur if you're checking a (usually long) string against a complex regular expression. The problem usually occurs if something towards the end of the string causes the string to not match.

Are Regexes slow?

My experience shows that most of the time developers focus on correctness of a regex, leaving aside its performance. Yet matching a string with a regex can be surprisingly slow. So slow it can even stop any JS app or take 100% of a server CPU time causing denial of service (DOS).

Is regular expression efficient?

While a well written regular expression can be very efficient, a badly written regular expression might take a long time to run and slow the system down significantly.

Why is regex expensive?

Poorly written regex patterns are not only costly, they are dangerous. This one has too many consequent subpatterns that can match empty strings. Thus, backtracking falls into them and fails, and re-tries, and fails... That makes your pattern costly.


2 Answers

One of the tricks you can do to figure out what the regex engine is doing, is:

use re 'debug'; 

E.g.:

use strict;
use warnings;
use re 'debug'; 

my $str = "a" x 18;

$str =~ m{ \(([^()]+|\( [^()]* \))+\)}x;

This will then print what the regex engine is actually doing:

Compiling REx " \(([^()]+|\( [^()]* \))+\)"
Final program:
   1: EXACT <(> (3)
   3: CURLYX[0] {1,32767} (40)
   5:   OPEN1 (7)
   7:     BRANCH (20)
   8:       PLUS (37)
   9:         ANYOF[^()][{above_bitmap_all}] (0)
  20:     BRANCH (FAIL)
  21:       EXACT <(> (23)
  23:       STAR (35)
  24:         ANYOF[^()][{above_bitmap_all}] (0)
  35:       EXACT <)> (37)
  37:   CLOSE1 (39)
  39: WHILEM[1/3] (0)
  40: NOTHING (41)
  41: EXACT <)> (43)
  43: END (0)
anchored "(" at 0 floating ")" at 2..2147483647 (checking floating) minlen 3 
Matching REx " \(([^()]+|\( [^()]* \))+\)" against "aaaaaaaaaaaaaaaaaa"
Intuit: trying to determine minimum start position...
  doing 'check' fbm scan, [2..18] gave -1
  Did not find floating substr ")"...
Match rejected by optimizer
Freeing REx: " \(([^()]+|\( [^()]* \))+\)"

If you add your brackets back in, you get a different result - I get around 2000 steps to process the regex. This gets longer with each additional letter - about 300 steps.

So I would say yes - catastrophic backtracking is occurring, but you may well find that processors (and regex engine optimisations) mean the time is considerably shorter.

use strict;
use warnings;
use re 'debug'; 

my $str = "((()"."a" x 100_000;
$str =~ m{ \(([^()]+|\( [^()]* \))+\)}x;

Runs considerably longer - but at least a proportion of it is because printing text to the screen is actually quite 'expensive' comparatively.

My test suggests (number of "a"s)

10 : 1221 lines of output (broadly correlated with regex steps)
11 : 1324 (+103)
12 : 1467 (+143)
13 : 1590 (+129)
14 : 1728 (+138)
15 : 1852 (+124)

20 : 2630 (approx +155/extra)
30 : 4536 (+190/extra)
40 : 6940 (+240/extra)
50 : 9846 (+290/extra)

100 - 31,846 (+440/extra letter)

So certainly looks like exponential behaviour going on - but in neither case, the processing time is still pretty quick, which I'd attribute to faster cpus (and maybe better optimisation of the regex engine)

like image 194
Sobrique Avatar answered Oct 17 '22 04:10

Sobrique


Consider how the pattern above detects no-match on ((()aaaaaaaaaaaaaaaaaa in several seconds.

This sentence dates back to at least April 2001, when perl 5.6.1 was released. You can see the perlre man page for that release at search.cpan.org/~gsar/perl-5.6.1/pod/perlre.pod.

This is perhaps a lesson to be learned here in documenting software. Be careful what you write as your written words will remain unchanged despite years of improvements and multiple forks by others.

like image 2
David Hammen Avatar answered Oct 17 '22 04:10

David Hammen