Is there any way to put malicious code into a regular expression?

Denial‐of‐Service Concerns

The most common concern with regexes is a denial‐of‐service attack through pathological patterns that go exponential — or even super‐exponential! — and so appear to take forever to solve. These may only show up on particular input data, but one can generally create one wherein this doesn’t matter.

Which ones these are will depend somewhat on how smart the regex compiler you’re using happens to be, because some of these can be detected during compilation time. Regex compilers that implement recursion usually have a built‐in recursion‐depth counter for checking non‐progression.

Russ Cox’s excellent 2007 paper on Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...) talks about ways that most modern NFAs, which all seem to derive from Henry Spencer’s code, suffer severe performance degradation, but where a Thompson‐style NFA has no such problems.

If you only admit patterns that can be solved by DFAs, you can compile them up as such, and they will run faster, possibly much faster. However, it takes time to do this. The Cox paper mentions this approach and its attendant issues. It all comes down to a classic time–space trade‐off.

With a DFA, you spend more time building it (and allocating more states), whereas with an NFA you spend more time executing it, since it can be multiple states at the same time, and backtracking can eat your lunch — and your CPU.

Denial‐of‐Service Solutions

Probably the most reasonable way to address these patterns that are on the losing end of a race with the heat‐death of the universe is to wrap them with a timer that effectively places a maximum amount of time allowed for their execution. Usually this will be much, much less than the default timeout that most HTTP servers provide.

There are various ways to implement these, ranging form a simple alarm(N) at the C level, to some sort of try {} block the catches alarm‐type exceptions, all the way to spawning off a new thread that’s specially created with a timing constraint built right into it.

Code Callouts

In regex languages that admit code callouts, some mechanism for allowing or disallowing these from the string you’re going to compile should be provided. Even if code callouts are only to code in the language you are using, you should restrict them; they don’t have to be able to call external code, although if they can, you’ve got much bigger problems.

For example, in Perl one cannot have code callouts in regexes created from string interpolation (as these would be, as they’re compiled during run‐time) unless the special lexically‐scoped pragma use re "eval"; in active in the current scope.

That way nobody can sneak in a code callout to run system programs like rm -rf *, for example. Because code callouts are so security‐sensitive, Perl disables them by default on all interpolated strings, and you have to go out of your way to re‐enable them.

User‐Defined \P{roperties}

There remains one more security‐sensitive issue related to Unicode-style properties — like \pM, \p{Pd}, \p{Pattern_Syntax}, or \p{Script=Greek} — that may exist in some regex compilers that support that notation.

The issue is that in some of these, the set of possible properties is user‐extensible. That means you can have custom properties that are actual code callouts to named functions in some particular namepace, like \p{GoodChars} or \p{Class::Good_Characters}. How your language handles those might be worth looking at.

Sandboxing

In Perl, a sandboxed compartment via the Safe module would give control over namespace visibility. Other languages offer similar sandboxing technologies. If such devices are available, you might want to look into them, because they are specifically designed for limited execution of untrusted code.

Adding to tchrist's excellent answer: the same Russ Cox who wrote the "Regular Expression" page has also released code! re2 is a C++ library which guarantees O(length_of_regex) runtime and configurable memory-use limit. It's used within Google so that you can type a regex into google code search -- meaning that it's been battle tested.

Yes.

Regexes can be used to perform DOS attacks.
There is no simple solution.

You'll want to read this paper:

Insecure Context Switching: Inoculating regular expressions for survivability The paper is more about what can go wrong with regular expression engines (e.g. PCRE), but it may help you understand what you're up against.

Related questions
                            
                                Regex to replace everything except numbers and a decimal point
                            
                                Validating IPv4 addresses with regexp
                            
                                Get string between two strings in a string
                            
                                Regex to get string between curly braces
                            
                                Which is the correct shorthand - "regex" or "regexp" [closed]
                            
                                Regular Expression: Any character that is NOT a letter or number
                            
                                Why does Javascript's regex.exec() not always return the same value? [duplicate]
                            
                                Can't escape the backslash with regex?
                            
                                How does this giant regex work?
                            
                                Search and Replace with RegEx components in Atom editor
                            
                                What is the meaning of the 'g' flag in regular expressions?
                            
                                Can I use an OR in regex without capturing what's enclosed?
                            
                                How would I get everything before a : in a string Python
                            
                                deny direct access to a folder and file by htaccess
                            
                                Decimal number regular expression, where digit after decimal is optional
                            
                                Regular Expression For Duplicate Words
                            
                                Use dynamic (variable) string as regex pattern in JavaScript
                            
                                Regular expression to match a word or its prefix
                            
                                Split string to equal length substrings in Java

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there any way to put malicious code into a regular expression?

Tags:

regex

security

People also ask