Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stop runaway regular expression

Tags:

regex

perl

Is there a way to stop a runaway regular expression?

I am not interested in suggestions on how to modify it. I know it can be modified so it doesn't break, etc, but I am running a single regex against thousands of inputs, so modifying it means I need to retest it on all the inputs. Not very practical.

So the exact question is: is there some form of timer that I can use to terminate a regex that takes longer than X seconds to complete?

like image 213
user3688176 Avatar asked May 29 '14 15:05

user3688176


1 Answers

Perl's built-in alarm is insufficient for breaking out of a long running regular expression because Perl doesn't give opportunities for alarm timeouts inside of internal opcodes. alarm simply cannot penetrate it.

In some cases the most obvious solution is to fork a subprocess and time it out after it's gone on too long using alarm. This PerlMonks post demonstrates how to time-out a forked process: Re: Timeout on script

There is a Perl module on CPAN called Sys::SigAction that has a function called timeout_call, which will interrupt a long-running regular expression using unsafe signals. However, the RE engine wasn't designed to be interrupted, and can be left in an unstable state that may lead to seg-faults about 10% of the time.

Here is some example code that demonstrates Sys::SigAction successfully breaking out of the regex engine, as well as demonstrating that Perl's alarm is incapable of doing so:

use Sys::SigAction 'timeout_call';
use Time::HiRes;


sub run_re {
  my $string = ('a' x 64 ) . 'b';

  if( $string =~ m/(a*a*a*a*a*a*a*a*a*a*a*a*)*[^Bb]$/ ) {
    print "Whoops!\n";
  }
  else {
    print "Ok!\n";
  }
}

print "Sys::SigAction::timeout_call:\n";
my $t = time();
timeout_call(2,\&run_re);
print time() - $t, " seconds.\n";

print "alarm:\n";
$t = time();

eval {
  local $SIG{ALRM} = sub { die "alarm\n" };
  alarm 2;
  run_re();
  alarm 0;
};

if( $@ ) {
  die unless $@ eq "alarm\n";
}
else {
  print time() - $t, " seconds.\n";
}

The output will be something along the lines of:

$ ./mytest.pl
Sys::SigAction::timeout_call:
Complex regular subexpression recursion limit (32766) exceeded at ./mytest.pl line 11.
2 seconds.
alarm:
Complex regular subexpression recursion limit (32766) exceeded at ./mytest.pl line 11.
^C

You will notice that in the second call -- the one that is supposed to time out with alarm, I finally had to ctrl-C out of it because alarm was inadequate for breaking out of the RE engine.

The big caveat with Sys::SigAction is that even though it is capable of breaking out of a long-running regular expression, because the RE engine wasn't designed for such interruptions, the entire process can become unstable, leading to a segfault. While it doesn't happen every time, it can happen. This probably isn't what you want.

I don't know what your regular expression looks like, but if it fits within the syntax allowed by the RE2 engine, you can use the Perl module, re::engine::RE2 to work with the RE2 C++ library. This engine guarantees linear time searches, though it provides less powerful semantics than Perl's built-in engine. The RE2 approach avoids the whole issue in the first place by providing a linear-time assurance.

However, if you're unable to use RE2 (possibly because your regex's semantics are too demanding for it), the fork/alarm method is probably the safest way to assure you remain in control.

(By the way, this question, and a version of my answer were crossposted to PerlMonks.)

like image 200
DavidO Avatar answered Oct 13 '22 16:10

DavidO