I recently had a SonarQube rule (https://rules.sonarsource.com/java/RSPEC-4784) bring to my attention some performance issues which could be used as a denial of service against a Java regular expression implementation. Indeed, the following Java test shows how slow the wrong regular expression can be: <pre class="prettyprint"><code> import org.junit.Test; public class RegexTest { @Test public void fastRegex1() { "aaaaaaaaaaaaaaaaaaaaaaaaaaaabs".matches("(a+)b"); } @Test public void fastRegex2() { "aaaaaaaaaaaaaaaaaaaaaaaaaaaab".matches("(a+)+b"); } @Test public void slowRegex() { "aaaaaaaaaaaaaaaaaaaaaaaaaaaabs".matches("(a+)+b"); } } </code></pre> As you can see, the first two tests are fast, the third one is incredibly slow (in Java 8) <img src="https://i.stack.imgur.com/5bUNZ.png" alt="Enter image description here"> The same data and regex in Perl or Python, however, is not at all slow, which leads me to wonder why it is that this regular expression is so slow to evaluate in Java. <pre class="prettyprint"><code>$ time perl -e '"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs" =~ /(a+)+b/ && print "$1\n"' aaaaaaaaaaaaaaaaaaaaaaaaaaaa real 0m0.004s user 0m0.000s sys 0m0.004s $ time python3 -c 'import re; m=re.search("(a+)+b","aaaaaaaaaaaaaaaaaaaaaaaaaaaabs"); print(m.group(0))' aaaaaaaaaaaaaaaaaaaaaaaaaaaab real 0m0.018s user 0m0.015s sys 0m0.004s </code></pre> What is it about the extra matching modifier <code>+</code> or trailing character <code>s</code> in the data which makes this regular expression so slow, and why is it only specific to Java?

Caveat: I don't really know much about regex internals, and this is really conjecture. And I can't answer why Java suffers from this, but not the others (also, it is substantially faster than your 12 seconds in jshell 11 when I run it, so it perhaps only affects certain versions). <pre class="prettyprint"><code>"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs".matches("(a+)+b") </code></pre> There are lots of ways that lots of <code>a</code>s could match: <pre class="prettyprint"><code>(a)(a)(a)(a) (aa)(a)(a) (a)(aa)(a) (aa)(aa) (a)(aaa) etc. </code></pre> For the input string <code>"aaaaaaaaaaaaaaaaaaaaaaaaaaaab"</code>, it will greedily match all of those <code>a</code>s in a single pass, match the <code>b</code>, job done. For <code>"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs"</code>, when it gets to the end and finds that the string doesn't match (because of the <code>s</code>), it's not correctly recognizing that the <code>s</code> means it can never match. So, having gone through and likely matched as <pre class="prettyprint"><code>(aaaaaaaaaaaaaaaaaaaaaaaaaaaa)bs </code></pre> it thinks "Oh, maybe it failed because of the way I grouped the <code>a</code>s - and goes back and tries all the other combinations of the <code>a</code>s. <pre class="prettyprint"><code>(aaaaaaaaaaaaaaaaaaaaaaaaaaa)(a)bs // Nope, still no match (aaaaaaaaaaaaaaaaaaaaaaaaaa)(aa)bs // ... (aaaaaaaaaaaaaaaaaaaaaaaaa)(aaa)bs // ... ... (a)(aaaaaaaaaaaaaaaaaaaaaaaaaaa)bs // ... (aaaaaaaaaaaaaaaaaaaaaaaaaa(a)(a)bs // ... (aaaaaaaaaaaaaaaaaaaaaaaaa(aa)(a)bs // ... (aaaaaaaaaaaaaaaaaaaaaaaa(aaa)(a)bs // ... ... </code></pre> There are lots of these (I think there are something like 2^27 - that's 134,217,728 - combinations for 28 <code>a</code>s, because each <code>a</code> can either be part of the previous group, or start its own group), so it takes a long time.

I don't know Perl too well but the Python version is not equivalent to the Java one. You are using <code>search()</code> but the Java version is using <code>matches()</code>. The equivalent method in Python would be <code>fullmatch()</code> When I run your examples in Python (3.8.2) with <code>search()</code> I get quick results as you do. When I run it with <code>fullmatch()</code> I get poor (multi-second) execution time. Could it be that your Perl example is also not doing a full match? BTW: if you want to try the Java version of search you would use: <pre class="prettyprint"><code>Pattern.compile("(a+)+b").matcher("aaaaaaaaaaaaaaaaaaaaaaaaaaaabs").find(); </code></pre> There might be some slight difference in the semantics but it should be close enough for this purpose.

The site https://swtch.com/~rsc/regexp/regexp1.html has some detailed information on regular expression implementation techniques and the theory behind them. I know link only answers are bad, but this is worth reading, showing an example regular expression that completes in 30 micro seconds with the better implementation, and 60 seconds (2 million times slower) with the better known and more obvious way. It says <blockquote> "Today, regular expressions have also become a shining example of how ignoring good theory leads to bad programs. The regular expression implementations used by today's popular tools are significantly slower than the ones used in many of those thirty-year-old Unix tools." </blockquote> Other answers saying that the extra <code>+</code> causes too much backtracking are correct, but only if you ignore the good theory.

Why is this regular expression so slow in Java? [duplicate]

Tags:

java

performance

python

regex

perl

I recently had a SonarQube rule (https://rules.sonarsource.com/java/RSPEC-4784) bring to my attention some performance issues which could be used as a denial of service against a Java regular expression implementation.

Indeed, the following Java test shows how slow the wrong regular expression can be:

    import org.junit.Test;

    public class RegexTest {

    @Test
    public void fastRegex1() {
        "aaaaaaaaaaaaaaaaaaaaaaaaaaaabs".matches("(a+)b");
    }

    @Test
    public void fastRegex2() {
        "aaaaaaaaaaaaaaaaaaaaaaaaaaaab".matches("(a+)+b");
    }

    @Test
    public void slowRegex() {
        "aaaaaaaaaaaaaaaaaaaaaaaaaaaabs".matches("(a+)+b");
    }
}

As you can see, the first two tests are fast, the third one is incredibly slow (in Java 8)

Enter image description here

The same data and regex in Perl or Python, however, is not at all slow, which leads me to wonder why it is that this regular expression is so slow to evaluate in Java.

$ time perl -e '"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs" =~ /(a+)+b/ && print "$1\n"'
aaaaaaaaaaaaaaaaaaaaaaaaaaaa

real    0m0.004s
user    0m0.000s
sys     0m0.004s

$ time python3 -c 'import re; m=re.search("(a+)+b","aaaaaaaaaaaaaaaaaaaaaaaaaaaabs"); print(m.group(0))'
aaaaaaaaaaaaaaaaaaaaaaaaaaaab

real    0m0.018s
user    0m0.015s
sys     0m0.004s

What is it about the extra matching modifier + or trailing character s in the data which makes this regular expression so slow, and why is it only specific to Java?

829

asked Aug 07 '20 08:08

AntonPiatek

4 Answers

Caveat: I don't really know much about regex internals, and this is really conjecture. And I can't answer why Java suffers from this, but not the others (also, it is substantially faster than your 12 seconds in jshell 11 when I run it, so it perhaps only affects certain versions).

"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs".matches("(a+)+b")

There are lots of ways that lots of as could match:

(a)(a)(a)(a)
(aa)(a)(a)
(a)(aa)(a)
(aa)(aa)
(a)(aaa)
etc.

For the input string "aaaaaaaaaaaaaaaaaaaaaaaaaaaab", it will greedily match all of those as in a single pass, match the b, job done.

For "aaaaaaaaaaaaaaaaaaaaaaaaaaaabs", when it gets to the end and finds that the string doesn't match (because of the s), it's not correctly recognizing that the s means it can never match. So, having gone through and likely matched as

(aaaaaaaaaaaaaaaaaaaaaaaaaaaa)bs

it thinks "Oh, maybe it failed because of the way I grouped the as - and goes back and tries all the other combinations of the as.

(aaaaaaaaaaaaaaaaaaaaaaaaaaa)(a)bs  // Nope, still no match
(aaaaaaaaaaaaaaaaaaaaaaaaaa)(aa)bs  // ...
(aaaaaaaaaaaaaaaaaaaaaaaaa)(aaa)bs  // ...
...
(a)(aaaaaaaaaaaaaaaaaaaaaaaaaaa)bs  // ...
(aaaaaaaaaaaaaaaaaaaaaaaaaa(a)(a)bs  // ...
(aaaaaaaaaaaaaaaaaaaaaaaaa(aa)(a)bs  // ...
(aaaaaaaaaaaaaaaaaaaaaaaa(aaa)(a)bs  // ...
...

There are lots of these (I think there are something like 2^27 - that's 134,217,728 - combinations for 28 as, because each a can either be part of the previous group, or start its own group), so it takes a long time.

answered Oct 17 '22 12:10

Andy Turner

I don't know Perl too well but the Python version is not equivalent to the Java one. You are using search() but the Java version is using matches(). The equivalent method in Python would be fullmatch()

When I run your examples in Python (3.8.2) with search() I get quick results as you do. When I run it with fullmatch() I get poor (multi-second) execution time. Could it be that your Perl example is also not doing a full match?

BTW: if you want to try the Java version of search you would use:

Pattern.compile("(a+)+b").matcher("aaaaaaaaaaaaaaaaaaaaaaaaaaaabs").find();

There might be some slight difference in the semantics but it should be close enough for this purpose.

answered Oct 17 '22 14:10

JimmyJames

The extra + causes a lot of backtracking (in a naive regexp implementation) when the string cannot be matched. If the string can be matched, the answer is known in the first try. This explains why the case 2 is fast and only case 3 is slow.

answered Oct 17 '22 13:10

Henry

The site https://swtch.com/~rsc/regexp/regexp1.html has some detailed information on regular expression implementation techniques and the theory behind them. I know link only answers are bad, but this is worth reading, showing an example regular expression that completes in 30 micro seconds with the better implementation, and 60 seconds (2 million times slower) with the better known and more obvious way.

It says

"Today, regular expressions have also become a shining example of how ignoring good theory leads to bad programs. The regular expression implementations used by today's popular tools are significantly slower than the ones used in many of those thirty-year-old Unix tools."

Other answers saying that the extra + causes too much backtracking are correct, but only if you ignore the good theory.