Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why regular expression .* is slower at one place and faster at other

Tags:

regex

Lately I am using a lot of regular expressions in java/groovy. For testing I routinely use regex101.com. Obviously I am looking at the regular expressions performance too.

One thing I noticed that using .* properly can significantly improve the overall performance. Primarily, using .* in between, or better to say not at the end of the regular expression is performance kill.

For example, in this regular expression the required number of steps is 27:

enter image description here

If I change first .* to \s*, it will reduce the steps required significantly to 16:

enter image description here

However, if I change second .* to \s*, it does not reduce the steps any further:

enter image description here

I have few questions:

  1. Why the above? I dont want to compare \s and .*. I know the difference. I want to know why \s and .* costs different based on their position in the complete regex. And then the characteristics of the regex which may cost different based on their position in the overall regex (or based on any other aspect other than position, if there is any).
  2. Does the steps counter given in this site really gives any indication about regex performance?
  3. what other simple or similar (position related) regex performance observations you have?
like image 548
Mahesha999 Avatar asked Nov 06 '15 13:11

Mahesha999


People also ask

What does * do in regular expression?

The Match-zero-or-more Operator ( * ) This operator repeats the smallest possible preceding regular expression as many times as necessary (including zero) to match the pattern.

Why are regular expressions slow?

The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ". *" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.

Why are regular expressions fast?

Good regular expressions are often longer than bad regular expressions because they make use of specific characters/character classes and have more structure. This causes good regular expressions to run faster as they predict their input more accurately.


1 Answers

The following is output from the debugger.

pattern 1

pattern 2

pattern 3

The big reason for the difference in performance is that .* will consume everything until the end of the string (except the newline). The pattern will then continue, forcing the regex to backtrack (as seen in the first image).

The reason that \s and .* perform equally well at the end of the pattern is that the greedy pattern vs. consuming whitespace makes no difference if there's nothing else to match (besides WS).

If your test string didn't end in whitespace, there would be a difference in performance, much like you saw in the first pattern - the regex would be forced to backtrack.

EDIT

You can see the performance difference if you end with something besides whitespace:

Bad:

^myname.*mahesh.*hiworld

bad

Better:

^myname.*mahesh\s*hiworld

little better

Even better:

^myname\s*mahesh\s*hiworld

Much better

like image 159
erip Avatar answered Oct 01 '22 11:10

erip