I'm working on a JMD (Java MarkDown) (a Java port of MarkDownSharp) but I'm having an issue with one regex in particular. For the file Markdown_Documentation_Syntax.text this regular expression dies:
private static final String BLOCK_TAGS_1 = "p|div|h[1-6]|blockquote|pre|table|dl|ol|ul|script|noscript|form|fieldset|iframe|math|ins|del";
private static final String BLOCKS_NESTED_PATTERN = String.format("" +
"(" + // save in $1
"^" + // start of line (with MULTILINE)
"<(%s)" + // start tag = $2
"\\b" + // word break
"(.*\\n)*?" + // any number of lines, minimally matching
"</\\2>" + // the matching end tag
"[ \\t]*" + // trailing spaces/tags
"(?=\\n+|\\Z)" + // followed by a newline or end of
")", BLOCK_TAGS_1);
which translates to:
(^<(p|div|h[1-6]|blockquote|pre|table|dl|ol|ul|script|noscript|form|fieldset|iframe|math|ins|del)\b(.*\n)*?</\2>[ \t]*(?=\n+|\Z))
This pattern is looking for accepted block tags that are anchored to the start of a line, followed by any number of lines and then are terminated by a matching tag followed by a newline or a string terminator. This generates:
java.lang.StackOverflowError
at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
...
This can be dealt with by increasing the stack space for Java (defaults to 128k/400k for oss/ss IIRC) but the above expression is slow anyway.
So I'm looking for a regex guru who can do better (or at least explain the performance problem with this pattern). The C# version is a little slow but works fine. PHP seems to have no issues with this either.
Edit: This is on JDK6u17 running on Windows 7 64 Ultimate.
This part:
(.*\n)*?
will involve A LOT of unnecessary backtracking because of the nested *
and since there are chars that have to match afterwards.
I just ran a quick benchmark in perl on some arbitrary strings and got a 13-15% improvement just by switching that piece to
(?>.*\n)*?
which does non-capturing, independent subgrouping. That gives you two benefits, it no longer wastes time capturing the matching string, and more importantly, it no longer backtracks on the innermost .*
which is a waste of time anyway. There's no way that only a portion of that .* will ever result in a valid match so explicitly making it all or nothing should help.
Don't know if that's a sufficient improvement in this case, however.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With