Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I speed up my Perl regex matching?

Tags:

regex

perl

I want to capture several text using the following regex:

$text_normal = qr{^(\/F\d+) FF (.*?) SCF SF (.*?) MV (\(.*?)SH$};

A sample of the string is like below:

my $text = '/F12345 FF FF this is SCF SF really MV (important stuff SH';

Can that be rewritten to speed up the matching?

like image 569
est Avatar asked Oct 20 '09 03:10

est


People also ask

How do I make regular expressions faster?

Expose Literal Characters Regex engines match fastest when anchors and literal characters are right there in the main pattern, rather than buried in sub-expressions. Hence the advice to "expose" literal characters whenever you can take them out of an alternation or quantified expression. Let's look at two examples.

Does compiling regex make it faster?

Regex has an interpreted mode and a compiled mode. The compiled mode takes longer to start, but is generally faster. Some users want both startup time and performance; other users want to run on a platform where JITted code is not allowed, and also want performance.

Is regex matching slow?

The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ". *" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.

Why is Perl good for regex?

In general, Perl uses a backtrack regex engine. Such an engine is flexible, easy to implement and very fast on a subset of regex. However, for other types of regex, for example when there is the | operator, it may become very slow. In the extreme case, its match speed is exponential in the length of the pattern.


3 Answers

There's no single answer to optimizing a regex. You can watch what a particular regex is doing with the re pragma:

 use re 'debugcolor';

Once you see what it traverses the string, you see where it is having problems and adjust your regex from there. You'll learn a little about the regex engine as you do that.

You should also check out Mastering Regular Expressions, which tells you how regular expressions work and why some patterns are slower than others.

like image 100
brian d foy Avatar answered Sep 21 '22 16:09

brian d foy


Without seeing some sample data it is hard to say.

Generally, it is a good idea to avoid using .*. Look for any possible sources of unneeded backtracking, and eliminate them.

You might be able to get away with a split with a slice if your needs are simple.

 my @vals = (split / /, $string)[0,2,5,7];
like image 30
daotoad Avatar answered Sep 22 '22 16:09

daotoad


This very much depends on the profile of the data you are scanning.

You identify the piece of your regular expression which filters out the most input and do a separate simpler regular expression for that expression.

For instance, if only 5% of your input date contained the 'MV' string, you could filter for this first and only apply the full more complex regular expression if the simpler one is true.

So you would have:

if ( $text_normal =~ / MV / ) {
  $text_normal = qr{^(\/F\d+) FF (.*?) SCF SF (.*?) MV (\(.*?)SH$};
  if .......
  }
}
like image 38
James Anderson Avatar answered Sep 18 '22 16:09

James Anderson