Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Regular Expression running very slow

I'm trying to use the Daring Fireball Regular Expression for matching URLs in Java, and I've found a URL that causes the evaluation to take forever. I've modified the original regex to work with Java syntax.

private final static String pattern = 
"\\b" + 
"(" +                            // Capture 1: entire matched URL
  "(?:" +
    "[a-z][\\w-]+:" +                // URL protocol and colon
    "(?:" +
      "/{1,3}" +                        // 1-3 slashes
      "|" +                             //   or
      "[a-z0-9%]" +                     // Single letter or digit or '%'
                                        // (Trying not to match e.g. "URI::Escape")
    ")" +
    "|" +                            //   or
    "www\\d{0,3}[.]" +               // "www.", "www1.", "www2." … "www999."
    "|" +                            //   or
    "[a-z0-9.\\-]+[.][a-z]{2,4}/" +  // looks like domain name followed by a slash
  ")" +
  "(?:" +                           // One or more:
    "[^\\s()<>]+" +                      // Run of non-space, non-()<>
    "|" +                               //   or
    "\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
  ")+" +
  "(?:" +                           // End with:
    "\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
    "|" +                                   //   or
    "[^\\s`!\\-()\\[\\]{};:'\".,<>?«»“”‘’]" +        // not a space or one of these punct chars (updated to add a 'dash'
  ")" +
")";

// @see http://daringfireball.net/2010/07/improved_regex_for_matching_urls
private static final Pattern DARING_FIREBALL_PATTERN = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);

If I attempt to run the following, it takes forever. I've narrowed it down to the matching of balanced parens (I think). If you change the text within the parens, it works fine, but at about 15 characters, it starts to slow down exponentially.

final Matcher matcher = pattern.matcher("https://goo.gl/a(something_really_long_in_balanced_parens)");
boolean found = matcher.find();

Is there a way to improve this regex so that the lines about don't take forever? I have about 100 different URLs in a JUnit test class, and I need those to continue to work as well.

like image 925
Valdis R Avatar asked Feb 16 '11 01:02

Valdis R


1 Answers

The problem is here:

"(?:" +                           // One or more:
"[^\\s()<>]+" +                      // Run of non-space, non-()<>
"|" +                               //   or
"\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
")+"

What you've got here is nested quantifiers. This plays havoc with any backtracking algorithm - as an example, consider the regex /^(a+)+$/ matching against the string

aaaaaaaaaab

As a first attempt, the inner quantifier will match all of the as. Then the regex fails, so it backs off one. Then the outer quantifier tries to match again, swallowing up the last a, then the regex fails once more. We basically get exponential behaviour as the quantifiers try all sorts of ways of splitting up the run of as, without actually making any progress.

The solution is possessive quantifiers (which we denote by tacking a + onto the end of a quantifier) - we set up the inner quantifiers so that once they have a match, they don't let it go - they'll hold onto that until the match fails or an earlier quantifier backs off and they have to rematch starting somewhere else in the string. If we instead used /^(a++)+$/ as our regex, we would fail immediately on the non-matching string above, rather than going exponential trying to match it.

Try making those inner quantifiers possessive and see if it helps.

like image 91
Anon. Avatar answered Nov 18 '22 22:11

Anon.