Is there a way to stop a runaway regular expression?
I am not interested in suggestions on how to modify it. I know it can be modified so it doesn't break, etc, but I am running a single regex against thousands of inputs, so modifying it means I need to retest it on all the inputs. Not very practical.
So the exact question is: is there some form of timer that I can use to terminate a regex that takes longer than X seconds to complete?
Perl's built-in alarm
is insufficient for breaking out of a long running regular expression because Perl doesn't give opportunities for alarm timeouts inside of internal opcodes. alarm
simply cannot penetrate it.
In some cases the most obvious solution is to fork
a subprocess and time it out after it's gone on too long using alarm
. This PerlMonks post demonstrates how to time-out a forked process: Re: Timeout on script
There is a Perl module on CPAN called Sys::SigAction that has a function called timeout_call
, which will interrupt a long-running regular expression using unsafe signals. However, the RE engine wasn't designed to be interrupted, and can be left in an unstable state that may lead to seg-faults about 10% of the time.
Here is some example code that demonstrates Sys::SigAction successfully breaking out of the regex engine, as well as demonstrating that Perl's alarm
is incapable of doing so:
use Sys::SigAction 'timeout_call';
use Time::HiRes;
sub run_re {
my $string = ('a' x 64 ) . 'b';
if( $string =~ m/(a*a*a*a*a*a*a*a*a*a*a*a*)*[^Bb]$/ ) {
print "Whoops!\n";
}
else {
print "Ok!\n";
}
}
print "Sys::SigAction::timeout_call:\n";
my $t = time();
timeout_call(2,\&run_re);
print time() - $t, " seconds.\n";
print "alarm:\n";
$t = time();
eval {
local $SIG{ALRM} = sub { die "alarm\n" };
alarm 2;
run_re();
alarm 0;
};
if( $@ ) {
die unless $@ eq "alarm\n";
}
else {
print time() - $t, " seconds.\n";
}
The output will be something along the lines of:
$ ./mytest.pl
Sys::SigAction::timeout_call:
Complex regular subexpression recursion limit (32766) exceeded at ./mytest.pl line 11.
2 seconds.
alarm:
Complex regular subexpression recursion limit (32766) exceeded at ./mytest.pl line 11.
^C
You will notice that in the second call -- the one that is supposed to time out with alarm
, I finally had to ctrl-C
out of it because alarm
was inadequate for breaking out of the RE engine.
The big caveat with Sys::SigAction is that even though it is capable of breaking out of a long-running regular expression, because the RE engine wasn't designed for such interruptions, the entire process can become unstable, leading to a segfault. While it doesn't happen every time, it can happen. This probably isn't what you want.
I don't know what your regular expression looks like, but if it fits within the syntax allowed by the RE2 engine, you can use the Perl module, re::engine::RE2 to work with the RE2 C++ library. This engine guarantees linear time searches, though it provides less powerful semantics than Perl's built-in engine. The RE2 approach avoids the whole issue in the first place by providing a linear-time assurance.
However, if you're unable to use RE2 (possibly because your regex's semantics are too demanding for it), the fork/alarm method is probably the safest way to assure you remain in control.
(By the way, this question, and a version of my answer were crossposted to PerlMonks.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With