Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java regex dies on stack overflow: need a better version

Tags:

java

regex

I'm working on a JMD (Java MarkDown) (a Java port of MarkDownSharp) but I'm having an issue with one regex in particular. For the file Markdown_Documentation_Syntax.text this regular expression dies:

private static final String BLOCK_TAGS_1 = "p|div|h[1-6]|blockquote|pre|table|dl|ol|ul|script|noscript|form|fieldset|iframe|math|ins|del";
private static final String BLOCKS_NESTED_PATTERN = String.format("" +
        "(" +                      // save in $1
        "^" +                      // start of line (with MULTILINE)
        "<(%s)" +                  // start tag = $2
        "\\b" +                    // word break
        "(.*\\n)*?" +              // any number of lines, minimally matching
        "</\\2>" +                 // the matching end tag
        "[ \\t]*" +                // trailing spaces/tags
        "(?=\\n+|\\Z)" +           // followed by a newline or end of
        ")", BLOCK_TAGS_1);

which translates to:

(^<(p|div|h[1-6]|blockquote|pre|table|dl|ol|ul|script|noscript|form|fieldset|iframe|math|ins|del)\b(.*\n)*?</\2>[ \t]*(?=\n+|\Z))

This pattern is looking for accepted block tags that are anchored to the start of a line, followed by any number of lines and then are terminated by a matching tag followed by a newline or a string terminator. This generates:

java.lang.StackOverflowError
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
        ...

This can be dealt with by increasing the stack space for Java (defaults to 128k/400k for oss/ss IIRC) but the above expression is slow anyway.

So I'm looking for a regex guru who can do better (or at least explain the performance problem with this pattern). The C# version is a little slow but works fine. PHP seems to have no issues with this either.

Edit: This is on JDK6u17 running on Windows 7 64 Ultimate.

like image 506
cletus Avatar asked Jan 04 '10 02:01

cletus


1 Answers

This part:

(.*\n)*?

will involve A LOT of unnecessary backtracking because of the nested * and since there are chars that have to match afterwards.

I just ran a quick benchmark in perl on some arbitrary strings and got a 13-15% improvement just by switching that piece to

(?>.*\n)*?

which does non-capturing, independent subgrouping. That gives you two benefits, it no longer wastes time capturing the matching string, and more importantly, it no longer backtracks on the innermost .* which is a waste of time anyway. There's no way that only a portion of that .* will ever result in a valid match so explicitly making it all or nothing should help.

Don't know if that's a sufficient improvement in this case, however.

like image 156
Rob Van Dam Avatar answered Nov 06 '22 19:11

Rob Van Dam