I have a growing list of regular expressions that I am using to parse through log files searching for "interesting" error and debug statements. I'm currently breaking them into 5 buckets, with most of them falling into 3 large buckets. I have over 140 of patterns so far, and the list is continuing to grow.
Most of the regular expressions are simple, but they're also fairly unique, so my opportunities to catch multiple matches with a single pattern are few and far between. Because of the nature of what I'm matching, the patterns tend to be obscure and therefor seldom matched against, so I'm doing a TON of work on each input line with the end result being that it fails to match anything, or matches one of the generic ones at the very end.
And because of the quantity of input (hundreds of megabytes of log files) I'm sometimes waiting for a minute or two for the script to finish. Hence my desire for a more efficient solution. I'm not interested in sacrificing clarity for speed, though.
I currently have the regular expressions set up like this:
if (($line =~ m{Failed in routing out}) ||
($line =~ m{Agent .+ failed}) ||
($line =~ m{Record Not Exist in DB}) ||
...
Is there a better way of structuring this so it's more efficient, yet still maintainable? Thanks!
In general, Perl uses a backtrack regex engine. Such an engine is flexible, easy to implement and very fast on a subset of regex. However, for other types of regex, for example when there is the | operator, it may become very slow.
Chaining regular expressionsRegular expressions can be chained together using the pipe character (|). This allows for multiple search options to be acceptable in a single regex string.
Multiline option, it matches either the newline character ( \n ) or the end of the input string. It does not, however, match the carriage return/line feed character combination.
m operator in Perl is used to match a pattern within the given text. The string passed to m operator can be enclosed within any character which will be used as a delimiter to regular expressions.
You might want to take a look at Regexp::Assemble. It's intended to handle exactly this sort of problem.
Boosted code from the module's synopsis:
use Regexp::Assemble;
my $ra = Regexp::Assemble->new;
$ra->add( 'ab+c' );
$ra->add( 'ab+-' );
$ra->add( 'a\w\d+' );
$ra->add( 'a\d+' );
print $ra->re; # prints a(?:\w?\d+|b+[-c])
You can even slurp your regex collection out of a separate file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With