Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does this regex take a long time to execute?

Tags:

java

regex

I found out, that for example this line has a very very long execution time:

System.out.println(
        ".. .. .. .. .. .. .. .. ..  .. .. .. .. .. .. .. .. .. .. .. .... .. .."
        .matches("(?i)(?:.* )?\\W?([a-z0-9-_\\.]+((?: *)\\.(?: *))+(?:DE))(?:[0-9]{1,5})?")
);

If I reduce the amount of dots at the start of the String the execution time gets lower (seems like it's exponential). Here is the suspended thread's stack trace:

[Repeating text]...
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$Curly.match0(Matcher, int, int, CharSequence) line: 4279
Pattern$Curly.match(Matcher, int, CharSequence) line: 4234
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$Loop.match(Matcher, int, CharSequence) line: 4785
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$Curly.match0(Matcher, int, int, CharSequence) line: 4279
Pattern$Curly.match(Matcher, int, CharSequence) line: 4234
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$Single(Pattern$BmpCharProperty).match(Matcher, int, CharSequence) line: 3798
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$Curly.match0(Matcher, int, int, CharSequence) line: 4272
Pattern$Curly.match(Matcher, int, CharSequence) line: 4234
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$Loop.match(Matcher, int, CharSequence) line: 4785
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$Curly.match0(Matcher, int, int, CharSequence) line: 4272
Pattern$Curly.match(Matcher, int, CharSequence) line: 4234
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$Single(Pattern$BmpCharProperty).match(Matcher, int, CharSequence) line: 3798
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$Curly.match0(Matcher, int, int, CharSequence) line: 4279
Pattern$Curly.match(Matcher, int, CharSequence) line: 4234
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$Loop.matchInit(Matcher, int, CharSequence) line: 4801
Pattern$Prolog.match(Matcher, int, CharSequence) line: 4741
Pattern$Curly.match0(Matcher, int, int, CharSequence) line: 4272
Pattern$Curly.match(Matcher, int, CharSequence) line: 4234
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$Ques.match(Matcher, int, CharSequence) line: 4182
Pattern$BranchConn.match(Matcher, int, CharSequence) line: 4568
Pattern$GroupTail.match(Matcher, int, CharSequence) line: 4717
Pattern$Single(Pattern$BmpCharProperty).match(Matcher, int, CharSequence) line: 3798
Pattern$Curly.match0(Matcher, int, int, CharSequence) line: 4272
Pattern$Curly.match(Matcher, int, CharSequence) line: 4234
Pattern$GroupHead.match(Matcher, int, CharSequence) line: 4658
Pattern$Branch.match(Matcher, int, CharSequence) line: 4604
Matcher.match(int, int) line: 1270
Matcher.matches() line: 604
Pattern.matches(String, CharSequence) line: 1135
String.matches(String) line: 2121
Main.main(String[]) line: 11

Why does this happen?

like image 266
EaZyCode Avatar asked May 24 '18 13:05

EaZyCode


People also ask

Is using regex slow?

The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ". *" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.

Is there anything faster than regex?

String operations will always be faster than regular expression operations. Unless, of course, you write the string operations in an inefficient way. Regular expressions have to be parsed, and code generated to perform the operation using string operations.

How fast are regex expressions?

The bad regular expression took on average 10,100 milliseconds to process all 1,000,000 lines, while the good regular expression took just 240 milliseconds.


1 Answers

When pattern x is made optional - using ? or * quantifiers (or {0,}) - engine has two paths to approach according to the nature of quantifier being used:

  • Consumes then backtracks for other patterns (case of greediness i.e. .*, .?)
  • First doesn't consume and looks immediately for other patterns (case of laziness .*?)

Someone probably is not aware about regular expressions or doesn't care about performance and throws .* wherever he needs a match somewhere in string and engines are so fast in taking steps back and forth that nothing seems weird or slow unless a pattern can not be found.

Time complexity starts at O(n) and continues with O(n^2b) where b is level of nesting quantifiers. So on failure number of steps an engine takes is HUGE.

To avoid such situations someone needs to consider some guiding principles:

  • Specifying boundaries. If pattern should stop somewhere before digits do not do .*. Instead do \D*.

  • Use conditions. You can check if pattern / letter x exists before running a whole match using a lookahead ^(?=[^x]*x). This leads to an early failure.

  • Use possessive quantifiers or atomic groups (if available). These two avoid backtracks. Sometimes you do not need backtracks.

  • Do not do (.*)+ or similar patterns. Instead reconsider your requirements or at least use atomic groups (?>.*)+.

Your own Regular Expression isn't an exception. It suffers from much greediness and optional matches and needs a time to be restudied.

like image 129
revo Avatar answered Nov 15 '22 03:11

revo