Java regex dies on stack overflow: need a better version

Question

I'm working on a JMD (Java MarkDown) (a Java port of MarkDownSharp) but I'm having an issue with one regex in particular. For the file Markdown_Documentation_Syntax.text this regular expression dies:

private static final String BLOCK_TAGS_1 = "p|div|h[1-6]|blockquote|pre|table|dl|ol|ul|script|noscript|form|fieldset|iframe|math|ins|del";
private static final String BLOCKS_NESTED_PATTERN = String.format("" +
        "(" +                      // save in $1
        "^" +                      // start of line (with MULTILINE)
        "<(%s)" +                  // start tag = $2
        "\b" +                    // word break
        "(.*\n)*?" +              // any number of lines, minimally matching
        "</\2>" +                 // the matching end tag
        "[ \t]*" +                // trailing spaces/tags
        "(?=\n+|\Z)" +           // followed by a newline or end of
        ")", BLOCK_TAGS_1);

which translates to:

(^<(p|div|h[1-6]|blockquote|pre|table|dl|ol|ul|script|noscript|form|fieldset|iframe|math|ins|del)\b(.*
)*?</\2>[ 	]*(?=
+|\Z))

This pattern is looking for accepted block tags that are anchored to the start of a line, followed by any number of lines and then are terminated by a matching tag followed by a newline or a string terminator. This generates:

java.lang.StackOverflowError
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
        ...

This can be dealt with by increasing the stack space for Java (defaults to 128k/400k for oss/ss IIRC) but the above expression is slow anyway.

So I'm looking for a regex guru who can do better (or at least explain the performance problem with this pattern). The C# version is a little slow but works fine. PHP seems to have no issues with this either.

Edit: This is on JDK6u17 running on Windows 7 64 Ultimate.

Rob Van Dam · Accepted Answer

This part:

(.*
)*?

will involve A LOT of unnecessary backtracking because of the nested * and since there are chars that have to match afterwards.

I just ran a quick benchmark in perl on some arbitrary strings and got a 13-15% improvement just by switching that piece to

(?>.*
)*?

which does non-capturing, independent subgrouping. That gives you two benefits, it no longer wastes time capturing the matching string, and more importantly, it no longer backtracks on the innermost .* which is a waste of time anyway. There's no way that only a portion of that .* will ever result in a valid match so explicitly making it all or nothing should help.

Don't know if that's a sufficient improvement in this case, however.

Java regex dies on stack overflow: need a better version

Tags:

java

regex

cletus

1 Answers

Rob Van Dam

Recent Activity

Donate For Us

Java regex dies on stack overflow: need a better version

Tags:

java

regex

cletus

1 Answers

Rob Van Dam

Related questions

Recent Activity

Donate For Us