I want to capture several text using the following regex:
$text_normal = qr{^(\/F\d+) FF (.*?) SCF SF (.*?) MV (\(.*?)SH$};
A sample of the string is like below:
my $text = '/F12345 FF FF this is SCF SF really MV (important stuff SH';
Can that be rewritten to speed up the matching?
Expose Literal Characters Regex engines match fastest when anchors and literal characters are right there in the main pattern, rather than buried in sub-expressions. Hence the advice to "expose" literal characters whenever you can take them out of an alternation or quantified expression. Let's look at two examples.
Regex has an interpreted mode and a compiled mode. The compiled mode takes longer to start, but is generally faster. Some users want both startup time and performance; other users want to run on a platform where JITted code is not allowed, and also want performance.
The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ". *" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.
In general, Perl uses a backtrack regex engine. Such an engine is flexible, easy to implement and very fast on a subset of regex. However, for other types of regex, for example when there is the | operator, it may become very slow. In the extreme case, its match speed is exponential in the length of the pattern.
There's no single answer to optimizing a regex. You can watch what a particular regex is doing with the re pragma:
use re 'debugcolor';
Once you see what it traverses the string, you see where it is having problems and adjust your regex from there. You'll learn a little about the regex engine as you do that.
You should also check out Mastering Regular Expressions, which tells you how regular expressions work and why some patterns are slower than others.
Without seeing some sample data it is hard to say.
Generally, it is a good idea to avoid using .*
. Look for any possible sources of unneeded backtracking, and eliminate them.
You might be able to get away with a split
with a slice if your needs are simple.
my @vals = (split / /, $string)[0,2,5,7];
This very much depends on the profile of the data you are scanning.
You identify the piece of your regular expression which filters out the most input and do a separate simpler regular expression for that expression.
For instance, if only 5% of your input date contained the 'MV'
string, you could filter for this first and only apply the full more complex regular expression if the simpler one is true.
So you would have:
if ( $text_normal =~ / MV / ) {
$text_normal = qr{^(\/F\d+) FF (.*?) SCF SF (.*?) MV (\(.*?)SH$};
if .......
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With