Lately I am using a lot of regular expressions in java/groovy. For testing I routinely use regex101.com. Obviously I am looking at the regular expressions performance too.
One thing I noticed that using .*
properly can significantly improve the overall performance. Primarily, using .*
in between, or better to say not at the end of the regular expression is performance kill.
For example, in this regular expression the required number of steps is 27:
If I change first .*
to \s*
, it will reduce the steps required significantly to 16:
However, if I change second .*
to \s*
, it does not reduce the steps any further:
I have few questions:
\s
and .*
. I know the difference. I want to know why \s
and .*
costs different based on their position in the complete regex. And then the characteristics of the regex which may cost different based on their position in the overall regex (or based on any other aspect other than position, if there is any). The Match-zero-or-more Operator ( * ) This operator repeats the smallest possible preceding regular expression as many times as necessary (including zero) to match the pattern.
The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ". *" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.
Good regular expressions are often longer than bad regular expressions because they make use of specific characters/character classes and have more structure. This causes good regular expressions to run faster as they predict their input more accurately.
The following is output from the debugger.
The big reason for the difference in performance is that .*
will consume everything until the end of the string (except the newline). The pattern will then continue, forcing the regex to backtrack (as seen in the first image).
The reason that \s
and .*
perform equally well at the end of the pattern is that the greedy pattern vs. consuming whitespace makes no difference if there's nothing else to match (besides WS).
If your test string didn't end in whitespace, there would be a difference in performance, much like you saw in the first pattern - the regex would be forced to backtrack.
EDIT
You can see the performance difference if you end with something besides whitespace:
Bad:
^myname.*mahesh.*hiworld
Better:
^myname.*mahesh\s*hiworld
Even better:
^myname\s*mahesh\s*hiworld
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With