Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I efficiently match many different regex patterns in Perl?

Tags:

regex

perl

I have a growing list of regular expressions that I am using to parse through log files searching for "interesting" error and debug statements. I'm currently breaking them into 5 buckets, with most of them falling into 3 large buckets. I have over 140 of patterns so far, and the list is continuing to grow.

Most of the regular expressions are simple, but they're also fairly unique, so my opportunities to catch multiple matches with a single pattern are few and far between. Because of the nature of what I'm matching, the patterns tend to be obscure and therefor seldom matched against, so I'm doing a TON of work on each input line with the end result being that it fails to match anything, or matches one of the generic ones at the very end.

And because of the quantity of input (hundreds of megabytes of log files) I'm sometimes waiting for a minute or two for the script to finish. Hence my desire for a more efficient solution. I'm not interested in sacrificing clarity for speed, though.

I currently have the regular expressions set up like this:

 if (($line =~ m{Failed in routing out}) ||
  ($line =~ m{Agent .+ failed}) ||
  ($line =~ m{Record Not Exist in DB}) ||
         ...

Is there a better way of structuring this so it's more efficient, yet still maintainable? Thanks!

like image 563
Joe Casadonte Avatar asked Sep 25 '09 15:09

Joe Casadonte


People also ask

Is Perl good for regex?

In general, Perl uses a backtrack regex engine. Such an engine is flexible, easy to implement and very fast on a subset of regex. However, for other types of regex, for example when there is the | operator, it may become very slow.

Can you chain regex?

Chaining regular expressionsRegular expressions can be chained together using the pipe character (|). This allows for multiple search options to be acceptable in a single regex string.

What is multiline matching?

Multiline option, it matches either the newline character ( \n ) or the end of the input string. It does not, however, match the carriage return/line feed character combination.

How do I match a pattern in Perl?

m operator in Perl is used to match a pattern within the given text. The string passed to m operator can be enclosed within any character which will be used as a delimiter to regular expressions.


1 Answers

You might want to take a look at Regexp::Assemble. It's intended to handle exactly this sort of problem.

Boosted code from the module's synopsis:

use Regexp::Assemble;

my $ra = Regexp::Assemble->new;
$ra->add( 'ab+c' );
$ra->add( 'ab+-' );
$ra->add( 'a\w\d+' );
$ra->add( 'a\d+' );
print $ra->re; # prints a(?:\w?\d+|b+[-c])

You can even slurp your regex collection out of a separate file.

like image 65
daotoad Avatar answered Nov 15 '22 00:11

daotoad