I recently became aware of Regular expression Denial of Service attacks, and decided to root out so-called 'evil' regex patterns wherever I could find them in my codebase - or at least those that are used on user input. The examples given at the OWASP link above and wikipedia are helpful, but they don't do a great job of explaining the problem in simple terms. A description of evil regexes, from wikipedia: <ul> <li>the regular expression applies repetition ("+", "*") to a complex subexpression;</li> <li>for the repeated subexpression, there exists a match which is also a suffix of another valid match.</li> </ul> With examples, again from wikipedia: <ul> <li><code>(a+)+</code></li> <li><code>([a-zA-Z]+)*</code></li> <li><code>(a|aa)+</code></li> <li><code>(a|a?)+</code></li> <li> <code>(.*a){x}</code> for x > 10</li> </ul> Is this a problem that just doesn't have a simpler explanation? I'm looking for something that would make it easier to avoid this problem while writing regexes, or to find them within an existing codebase.

<h3>Why Are Evil Regexes A Problem?</h3> Because computers do exactly what you tell them to do, even if it's not what you meant or is totally unreasonable. If you ask a regex engine to prove that, for some given input, there either is or is not a match for a given pattern, then the engine will attempt to do that no matter how many different combinations must be tested. Here is a simple pattern inspired by the first example in the OP's post: <pre class="prettyprint"><code>^((ab)*)+$ </code></pre> Given the input: <blockquote> abababababababababababab </blockquote> The regex engine tries something like <code>(abababababababababababab)</code> and a match is found on the first try. But then we throw the monkey wrench in: <blockquote> abababababababababababab a </blockquote> The engine will first try <code>(abababababababababababab)</code> but that fails because of that extra <code>a</code>. This causes catastrophic backtracking, because our pattern <code>(ab)*</code>, in a show of good faith, will release one of its captures (it will "backtrack") and let the outer pattern try again. For our regex engine, that looks something like this: <blockquote> <code>(abababababababababababab)</code> - Nope <code>(ababababababababababab)(ab)</code> - Nope <code>(abababababababababab)(abab)</code> - Nope <code>(abababababababababab)(ab)(ab)</code> - Nope <code>(ababababababababab)(ababab)</code> - Nope <code>(ababababababababab)(abab)(ab)</code> - Nope <code>(ababababababababab)(ab)(abab)</code> - Nope <code>(ababababababababab)(ab)(ab)(ab)</code> - Nope <code>(abababababababab)(abababab)</code> - Nope <code>(abababababababab)(ababab)(ab)</code> - Nope <code>(abababababababab)(abab)(abab)</code> - Nope <code>(abababababababab)(abab)(ab)(ab)</code> - Nope <code>(abababababababab)(ab)(ababab)</code> - Nope <code>(abababababababab)(ab)(abab)(ab)</code> - Nope <code>(abababababababab)(ab)(ab)(abab)</code> - Nope <code>(abababababababab)(ab)(ab)(ab)(ab)</code> - Nope <code>(ababababababab)(ababababab)</code> - Nope <code>(ababababababab)(abababab)(ab)</code> - Nope <code>(ababababababab)(ababab)(abab)</code> - Nope <code>(ababababababab)(ababab)(ab)(ab)</code> - Nope <code>(ababababababab)(abab)(abab)(ab)</code> - Nope <code>(ababababababab)(abab)(ab)(abab)</code> - Nope <code>(ababababababab)(abab)(ab)(ab)(ab)</code> - Nope <code>(ababababababab)(ab)(abababab)</code> - Nope <code>(ababababababab)(ab)(ababab)(ab)</code> - Nope <code>(ababababababab)(ab)(abab)(abab)</code> - Nope <code>(ababababababab)(ab)(abab)(ab)(ab)</code> - Nope <code>(ababababababab)(ab)(ab)(ababab)</code> - Nope <code>(ababababababab)(ab)(ab)(abab)(ab)</code> - Nope <code>(ababababababab)(ab)(ab)(ab)(abab)</code> - Nope <code>(ababababababab)(ab)(ab)(ab)(ab)(ab)</code> - Nope ... <code>(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abababab)</code> - Nope <code>(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ababab)(ab)</code> - Nope <code>(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(abab)</code> - Nope <code>(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(ab)(ab)</code> - Nope <code>(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ababab)</code> - Nope <code>(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(ab)</code> - Nope <code>(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)</code> - Nope <code>(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)</code> - Nope </blockquote> The number of possible combinations scales exponentially with the length of the input and, before you know it, the regex engine is eating up all your system resources trying to solve this thing until, having exhausted every possible combination of terms, it finally gives up and reports "There is no match." Meanwhile your server has turned into a burning pile of molten metal. <h3>How to Spot Evil Regexes</h3> It's actually very tricky. Catastrophic backtracking in modern regex engines is similar in nature to the halting problem which Alan Turing proved was impossible to solve. I have written problematic regexes myself, even though I know what they are and generally how to avoid them. Wrapping everything you can in an atomic group can help to prevent the backtracking issue. It basically tells the regex engine not to revisit a given expression - "lock whatever you matched on the first try". Note, however, that atomic expressions don't prevent backtracking within the expression, so <code>^(?>((ab)*)+)$</code> is still dangerous, but <code>^(?>(ab)*)+$</code> is safe (it'll match <code>(abababababababababababab)</code> and then refuse to give up any of it's matched characters, thus preventing catastrophic backtracking). Unfortunately, once it's written, it's actually very hard to immediately or quickly find a problem regex. In the end, recognizing a bad regex is like recognizing any other bad code - it takes a lot of time and experience and/or a single catastrophic event. <hr> Interestingly, since this answer was first written, a team at the University of Texas at Austin published a paper describing the development of a tool capable of performing static analysis of regular expressions with the express purpose of finding these "evil" patterns. The tool was developed to analyse Java programs, but I suspect that in the coming years we'll see more tools developed around analysing and detecting problematic patterns in JavaScript and other languages, especially as the rate of ReDoS attacks continues to climb. <blockquote> <a href="https://www.cs.utexas.edu/%7Eisil/redos.pdf" rel="noreferrer">Static Detection of DoS Vulnerabilities in Programs that use Regular Expressions</a> Valentin Wüstholz, Oswaldo Olivo, Marijn J. H. Heule, and Isil Dillig The University of Texas at Austin </blockquote>

What you call an "evil" regex is a regex that exhibits catastrophic backtracking. The linked page (which I wrote) explains the concept in detail. Basically, catastrophic backtracking happens when a regex fails to match and different permutations of the same regex can find a partial match. The regex engine then tries all those permutations. If you want to go over your code and inspect your regexes these are the 3 key issues to look at: <ol> <li>Alternatives must be mutually exclusive. If multiple alternatives can match the same text then the engine will try both if the remainder of the regex fails. If the alternatives are in a group that is repeated, you have catastrophic backtracking. A classic example is <code>(.|\s)*</code> to match any amount of any text when the regex flavor does not have a "dot matches line breaks" mode. If this is part of a longer regex then a subject string with a sufficiently long run of spaces (matched by both <code>.</code> and <code>\s</code>) will break the regex. The fix is to use <code>(.|\n)*</code> to make the alternatives mutually exclusive or even better to be more specific about which characters are really allowed, such as <code>[\r\n\t\x20-\x7E]</code> for ASCII printables, tabs, and line breaks.</li> <li>Quantified tokens that are in sequence must either be mutually exclusive with each other or be mutually exclusive what comes between them. Otherwise both can match the same text and all combinations of the two quantifiers will be tried when the remainder of the regex fails to match. A classic example is <code>a.*?b.*?c</code> to match 3 things with "anything" between them. When <code>c</code> can't be matched the first <code>.*?</code> will expand character by character until the end of the line or file. For each expansion the second <code>.*?</code> will expand character by character to match the remainder of the line or file. The fix is to realize that you can't have "anything" between them. The first run needs to stop at <code>b</code> and the second run needs to stop at <code>c</code>. With single characters <code>a[^b]*+b[^c]*+c</code> is an easy solution. Since we now stop at the delimiter, we can use possessive quantifiers to further increase performance.</li> <li>A group that contains a token with a quantifier must not have a quantifier of its own unless the quantified token inside the group can only be matched with something else that is mutually exclusive with it. That ensures that there is no way that fewer iterations of the outer quantifier with more iterations of the inner quantifier can match the same text as more iterations of the outer quantifier with fewer iterations of the inner quantifier. This is the problem illustrated in JDB's answer.</li> </ol> While I was writing my answer I decided that this merited a full article on my website. This is now online too.

How can I recognize an evil regex?

Tags:

regex

I recently became aware of Regular expression Denial of Service attacks, and decided to root out so-called 'evil' regex patterns wherever I could find them in my codebase - or at least those that are used on user input. The examples given at the OWASP link above and wikipedia are helpful, but they don't do a great job of explaining the problem in simple terms.

A description of evil regexes, from wikipedia:

the regular expression applies repetition ("+", "*") to a complex subexpression;
for the repeated subexpression, there exists a match which is also a suffix of another valid match.

With examples, again from wikipedia:

(a+)+
([a-zA-Z]+)*
(a|aa)+
(a|a?)+
(.*a){x} for x > 10

Is this a problem that just doesn't have a simpler explanation? I'm looking for something that would make it easier to avoid this problem while writing regexes, or to find them within an existing codebase.

277

asked Oct 11 '12 14:10

Mike Partridge

2 Answers

Why Are Evil Regexes A Problem?

Because computers do exactly what you tell them to do, even if it's not what you meant or is totally unreasonable. If you ask a regex engine to prove that, for some given input, there either is or is not a match for a given pattern, then the engine will attempt to do that no matter how many different combinations must be tested.

Here is a simple pattern inspired by the first example in the OP's post:

^((ab)*)+$

Given the input:

abababababababababababab

The regex engine tries something like (abababababababababababab) and a match is found on the first try.

But then we throw the monkey wrench in:

abababababababababababab a

The engine will first try (abababababababababababab) but that fails because of that extra a. This causes catastrophic backtracking, because our pattern (ab)*, in a show of good faith, will release one of its captures (it will "backtrack") and let the outer pattern try again. For our regex engine, that looks something like this:

(abababababababababababab) - Nope
(ababababababababababab)(ab) - Nope
(abababababababababab)(abab) - Nope
(abababababababababab)(ab)(ab) - Nope
(ababababababababab)(ababab) - Nope
(ababababababababab)(abab)(ab) - Nope
(ababababababababab)(ab)(abab) - Nope
(ababababababababab)(ab)(ab)(ab) - Nope
(abababababababab)(abababab) - Nope
(abababababababab)(ababab)(ab) - Nope
(abababababababab)(abab)(abab) - Nope
(abababababababab)(abab)(ab)(ab) - Nope
(abababababababab)(ab)(ababab) - Nope
(abababababababab)(ab)(abab)(ab) - Nope
(abababababababab)(ab)(ab)(abab) - Nope
(abababababababab)(ab)(ab)(ab)(ab) - Nope
(ababababababab)(ababababab) - Nope
(ababababababab)(abababab)(ab) - Nope
(ababababababab)(ababab)(abab) - Nope
(ababababababab)(ababab)(ab)(ab) - Nope
(ababababababab)(abab)(abab)(ab) - Nope
(ababababababab)(abab)(ab)(abab) - Nope
(ababababababab)(abab)(ab)(ab)(ab) - Nope
(ababababababab)(ab)(abababab) - Nope
(ababababababab)(ab)(ababab)(ab) - Nope
(ababababababab)(ab)(abab)(abab) - Nope
(ababababababab)(ab)(abab)(ab)(ab) - Nope
(ababababababab)(ab)(ab)(ababab) - Nope
(ababababababab)(ab)(ab)(abab)(ab) - Nope
(ababababababab)(ab)(ab)(ab)(abab) - Nope
(ababababababab)(ab)(ab)(ab)(ab)(ab) - Nope
...
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abababab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ababab)(ab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(abab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(ab)(ab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ababab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(ab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab) - Nope

The number of possible combinations scales exponentially with the length of the input and, before you know it, the regex engine is eating up all your system resources trying to solve this thing until, having exhausted every possible combination of terms, it finally gives up and reports "There is no match." Meanwhile your server has turned into a burning pile of molten metal.

How to Spot Evil Regexes

It's actually very tricky. Catastrophic backtracking in modern regex engines is similar in nature to the halting problem which Alan Turing proved was impossible to solve. I have written problematic regexes myself, even though I know what they are and generally how to avoid them. Wrapping everything you can in an atomic group can help to prevent the backtracking issue. It basically tells the regex engine not to revisit a given expression - "lock whatever you matched on the first try". Note, however, that atomic expressions don't prevent backtracking within the expression, so ^(?>((ab)*)+)$ is still dangerous, but ^(?>(ab)*)+$ is safe (it'll match (abababababababababababab) and then refuse to give up any of it's matched characters, thus preventing catastrophic backtracking).

Unfortunately, once it's written, it's actually very hard to immediately or quickly find a problem regex. In the end, recognizing a bad regex is like recognizing any other bad code - it takes a lot of time and experience and/or a single catastrophic event.

Interestingly, since this answer was first written, a team at the University of Texas at Austin published a paper describing the development of a tool capable of performing static analysis of regular expressions with the express purpose of finding these "evil" patterns. The tool was developed to analyse Java programs, but I suspect that in the coming years we'll see more tools developed around analysing and detecting problematic patterns in JavaScript and other languages, especially as the rate of ReDoS attacks continues to climb.

Static Detection of DoS Vulnerabilities in Programs that use Regular Expressions
Valentin Wüstholz, Oswaldo Olivo, Marijn J. H. Heule, and Isil Dillig
The University of Texas at Austin

155

answered Sep 28 '22 22:09

JDB

What you call an "evil" regex is a regex that exhibits catastrophic backtracking. The linked page (which I wrote) explains the concept in detail. Basically, catastrophic backtracking happens when a regex fails to match and different permutations of the same regex can find a partial match. The regex engine then tries all those permutations. If you want to go over your code and inspect your regexes these are the 3 key issues to look at:

Alternatives must be mutually exclusive. If multiple alternatives can match the same text then the engine will try both if the remainder of the regex fails. If the alternatives are in a group that is repeated, you have catastrophic backtracking. A classic example is (.|\s)* to match any amount of any text when the regex flavor does not have a "dot matches line breaks" mode. If this is part of a longer regex then a subject string with a sufficiently long run of spaces (matched by both . and \s) will break the regex. The fix is to use (.|\n)* to make the alternatives mutually exclusive or even better to be more specific about which characters are really allowed, such as [\r\n\t\x20-\x7E] for ASCII printables, tabs, and line breaks.
Quantified tokens that are in sequence must either be mutually exclusive with each other or be mutually exclusive what comes between them. Otherwise both can match the same text and all combinations of the two quantifiers will be tried when the remainder of the regex fails to match. A classic example is a.*?b.*?c to match 3 things with "anything" between them. When c can't be matched the first .*? will expand character by character until the end of the line or file. For each expansion the second .*? will expand character by character to match the remainder of the line or file. The fix is to realize that you can't have "anything" between them. The first run needs to stop at b and the second run needs to stop at c. With single characters a[^b]*+b[^c]*+c is an easy solution. Since we now stop at the delimiter, we can use possessive quantifiers to further increase performance.
A group that contains a token with a quantifier must not have a quantifier of its own unless the quantified token inside the group can only be matched with something else that is mutually exclusive with it. That ensures that there is no way that fewer iterations of the outer quantifier with more iterations of the inner quantifier can match the same text as more iterations of the outer quantifier with fewer iterations of the inner quantifier. This is the problem illustrated in JDB's answer.

While I was writing my answer I decided that this merited a full article on my website. This is now online too.

answered Sep 28 '22 22:09

Jan Goyvaerts

Related questions
                            
                                Regex for converting CamelCase to camel_case in java
                            
                                Regular Expression Match to test for a valid year
                            
                                How to check that a string is a palindrome using regular expressions?
                            
                                Regular expression containing one word or another
                            
                                JavaScript - string regex backreferences
                            
                                Using sed and grep/egrep to search and replace
                            
                                grepping using the "|" alternative operator
                            
                                Which regular expression operator means 'Don't' match this character?
                            
                                Javascript regular expression: remove first and last slash
                            
                                Python regex - r prefix
                            
                                Get the index of a pattern in a string using regex
                            
                                Remove part of a string
                            
                                Is there a difference between /\s/g and /\s+/g?
                            
                                Validate email address in Dart? [duplicate]
                            
                                Is it possible for a computer to "learn" a regular expression by user-provided examples?
                            
                                Getting the text that follows after the regex match
                            
                                How do I do a case insensitive regular expression in Go?
                            
                                A regex for version number parsing
                            
                                Number of occurrences of a character in a string [duplicate]
                            
                                How do you validate a URL with a regular expression in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With