Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular Expression High CPU

Tags:

java

regex

We are encountering issues in our production environment with 100% CPU utilization and thread dump shows our "Black List" implementation stuck on following.

[11/14/13 10:12:42:745 CST] 0000000d ThreadMonitor W   WSVR0605W: Thread "Thread-002" (0000003b) has been active for 604063 milliseconds and may be hung.  There is/are 1 thread(s) in total in the server that may be hung.
    at java.lang.Character.isLetterOrDigit(Character.java:3516)
    at java.util.regex.Pattern$Bound.check(Pattern.java:4820)
    at java.util.regex.Pattern$Bound.match(Pattern.java:4832)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Matcher.match(Matcher.java:1127)
    at java.util.regex.Matcher.matches(Matcher.java:502)

The problem is this does not happen always which mean it is data dependent and we have following list of Regular Expressions in our Black List implementation.

I would appreciate if someone can tell which regular expression out of the following can potentially cause the 100% CPU (I would guess due to back tracking).

# zero/null bytes 
.*\x00.*

# javascript
.*\bjavascript\b.* &&! ^[\w.,;/*+= -]*\b(:?application/x-javascript|text/javascript)\b[\w.,;/*+= -]*$
# vbscript
.*\bvbscript\b.*

# <script or </script>
.*<\s*/?\s*script\b.*|.*\bscript\s*>.*
# <img or </img>
.*<\s*/?\s*img\b.*|.*\bimg\s*>.*
# <div or </div>
.*<\s*/?\s*div\b.*|.*\bdiv\s*>.*
# <html or </html>
.*<\s*/?\s*html\b.*|.*\bhtml\s*>.*
# <body or </body>
.*<\s*/?\s*body\b.*|.*\bbody\s*>.*
# <link or </link>
.*<\s*/?\s*link\b.*|.*\blink\s*>.* &&! ^<\?xml .*$
# <meta or </meta>
.*<\s*/?\s*meta\b.*|.*\bmeta\s*>.*
# <base or </base>
.*<\s*/?\s*base\b.*|.*\bbase\s*>.*
# <span or </span>
.*<\s*/?\s*span\b.*|.*\bspan\s*>.*
# <input or </input>
.*<\s*/?\s*input\b.*|.*\binput\s*>.*
# <style or </style>
.*<\s*/?\s*style\b.*|.*\bstyle\s*>.*
# <table or </table>
.*<\s*/?\s*table\b.*|.*\btable\s*>.*
# <embed or </embed>
.*<\s*/?\s*embed\b.*|.*\bembed\s*>.*
# <frame or </frame>
.*<\s*/?\s*frame\b.*|.*\bframe\s*>.*
# <iframe or </iframe>
.*<\s*/?\s*iframe\b.*|.*\biframe\s*>.*
# <object or </object>
.*<\s*/?\s*object\b.*|.*\bobject\s*>.*

# < onload= >
.*<.+\bonload\s*=.+>.*
# < onerror= >
.*<.+\bonerror\s*=.+>.*
# < onmouseover= >
.*<.+\bonmouseover\s*=.+>.*
# < src= >
.*<.+\bsrc\s*=.+>.*
# < href= >
.*<.+\bhref\s*=.+>.*
# < style= >
.*<.+\bstyle\s*=.+>.*
# < content= >
.*<.+\bcontent\s*=.+>.*

# document.
.*\bdocument\s*\..+
# element.
.*\belement\s*\..+

# url(
.*\burl\s*\(.+
# eval(
.*\beval\s*\(.+
# alert(
.*\balert\s*\(.+

# /* */
.*/\*.*\*/.*


# HTTP response splitting
.*\bHTTP/\d+\.\d+.+


# Path traversal
#.*\.\.[/\\].*
.*\.\.[/\\]\.\.[/\\].*


# SQL injection (probably not very useful)

# from HttpServletBase.java
.*select\s+\S*\s*from\s+\S+(?:\s+where\s+.+)?.*
.*insert\s+\S*\s*into\s+\S+(?:\s+values\s+.+)?.*
.*update\s+\S*\s*set\s+\S+(?:\s+where\s+.+)?.*
.*delete\s+\S*\s*from\s+\S+(?:\s+where\s+.+)?.*
like image 753
user3017924 Avatar asked Apr 18 '26 04:04

user3017924


1 Answers

As I say in the comments, proper testing of the data is the only way to really know which regex is problematic. But I think there is something I can pinpoint and you can look into. Basically, be careful about the dot operator.

Look at:

.*<.+\bonload\s*=.+>.*

I guess you want to find any HTML tag that contains "onload". The thing is, if you have data with a lot of tags, and none contain "onload", it will do like this:

  1. Find the first <
  2. Go through the rest of the string to find "\bonload"
  3. Since none can be found, backtrack and try with the following < in the string.

This will therefore be repeated for every tag in the string, so step 2 can be expensive if your string is long. You can optimize by preventing it to go past the end of the tag by replacing .+ with [^>]+ (that is, anything but a >).

So the following regex should perform better:

.*<[^>]+\bonload\s*=.+>.*

Also, following the general principle that "the fastest code is the code that doesn't run", you can look if the string contains the string "onload" with a simple string search before using a regular expression.

like image 194
Cyrille Ka Avatar answered Apr 20 '26 17:04

Cyrille Ka



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!