Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java pattern matching going to infinite loop

Tags:

java

regex

A friend gave me this piece of code and said there is a bug. And yes, this code runs for ever.

The answer I got is:

It runs for >10^15 years before printing anything.

public class Match {
     public static void main(String[] args) {
         Pattern p = Pattern.compile("(aa|aab?)+");
         int count = 0;
         for(String s = ""; s.length() < 200; s += "a")
             if (p.matcher(s).matches())
                 count++;
         System.out.println(count);
     }
}

I didn't really understand why am I seeing this behavior, I am new to java, do you have any suggestions?

like image 680
John Avatar asked Oct 26 '11 04:10

John


2 Answers

The pattern you are using is known as an evil regex according to OWASP (they know what they're talking about most of the time):

https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS

It basically matches aa OR aa or aab (since the b is optional by addition of ?)

A Regex like this is vulnerable to a ReDoS or Regex Denial of Service Attack.

So yes, sort out what you want to match. I suggest in the above example you should simply match aa, no need for groups, repitition or alternation:

Pattern p = Pattern.compile("aa");

Also as someone pointed out, who now deleted his post, you should not use += to append to strings. You should use a StringBuffer instead:

public class Match {
  public static void main(String[] args) {
    Pattern p = Pattern.compile("aa");
    StringBuffer buffy = new StringBuffer(200);
    int count = 0;
    for(int i = 0; i < 200; i++){
      buffy.append("a")
      if (p.matcher(buffy.toString()).matches()){
        count++;
      }
    }
    System.out.println(count);
  }
}
like image 72
Benjamin Udink ten Cate Avatar answered Sep 30 '22 19:09

Benjamin Udink ten Cate


The regular expression (aa|aab?)+ is one that takes an especially long time for the regular expression engine to handle. These are colorfully called evil regexes. It is similar to the (a|aa)+ example at the link. This particular one is very slow on a string composed entirely of as.

What this code does is check the evil regex against increasingly long strings of as, up to length 200, so it certainly ought to take a long time, and it doesn't print until the loop ends. I'd be interested to know where the 10^15 years figure came from.

Edit

OK, the 10^15 (and in fact the entire piece of code in the question) comes from this talk, slide 37. Thanks to zengr for that link. The most relevant piece of information to the question is that the check for this regex takes time that is exponential in the length of the string. Specifically it's O(2^(n/2)), so it takes 2^99 (or so) times longer to check the last string than the first one.

like image 24
andronikus Avatar answered Sep 30 '22 19:09

andronikus