Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there any way to put malicious code into a regular expression?

Tags:

regex

security

People also ask

What is regex injection?

Regex Injection. Regex Injection. Regular expressions (regex) are a way of describing the order and type of characters that occur in a string. They are often used to validate input or search for “wildcard” matches within a set of strings.

Should I use regex in code?

I would always use a regex unless it's something very simple such as splitting a comma-separated string. If I think there's a chance the strings might one day get more complicated, I'll probably start with a regex.

What is regex in cyber security?

At its most basic, a regular expression (or "regex") is just a string that describes a pattern to be matched. For example, imagine a program scanning lines in one or more files, looking for lines that contain the regular expression pattern of interest. When it finds a line with that pattern, it prints that line out.


Denial‐of‐Service Concerns

The most common concern with regexes is a denial‐of‐service attack through pathological patterns that go exponential — or even super‐exponential! — and so appear to take forever to solve. These may only show up on particular input data, but one can generally create one wherein this doesn’t matter.

Which ones these are will depend somewhat on how smart the regex compiler you’re using happens to be, because some of these can be detected during compilation time. Regex compilers that implement recursion usually have a built‐in recursion‐depth counter for checking non‐progression.

Russ Cox’s excellent 2007 paper on Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...) talks about ways that most modern NFAs, which all seem to derive from Henry Spencer’s code, suffer severe performance degradation, but where a Thompson‐style NFA has no such problems.

If you only admit patterns that can be solved by DFAs, you can compile them up as such, and they will run faster, possibly much faster. However, it takes time to do this. The Cox paper mentions this approach and its attendant issues. It all comes down to a classic time–space trade‐off.

With a DFA, you spend more time building it (and allocating more states), whereas with an NFA you spend more time executing it, since it can be multiple states at the same time, and backtracking can eat your lunch — and your CPU.

Denial‐of‐Service Solutions

Probably the most reasonable way to address these patterns that are on the losing end of a race with the heat‐death of the universe is to wrap them with a timer that effectively places a maximum amount of time allowed for their execution. Usually this will be much, much less than the default timeout that most HTTP servers provide.

There are various ways to implement these, ranging form a simple alarm(N) at the C level, to some sort of try {} block the catches alarm‐type exceptions, all the way to spawning off a new thread that’s specially created with a timing constraint built right into it.

Code Callouts

In regex languages that admit code callouts, some mechanism for allowing or disallowing these from the string you’re going to compile should be provided. Even if code callouts are only to code in the language you are using, you should restrict them; they don’t have to be able to call external code, although if they can, you’ve got much bigger problems.

For example, in Perl one cannot have code callouts in regexes created from string interpolation (as these would be, as they’re compiled during run‐time) unless the special lexically‐scoped pragma use re "eval"; in active in the current scope.

That way nobody can sneak in a code callout to run system programs like rm -rf *, for example. Because code callouts are so security‐sensitive, Perl disables them by default on all interpolated strings, and you have to go out of your way to re‐enable them.

User‐Defined \P{roperties}

There remains one more security‐sensitive issue related to Unicode-style properties — like \pM, \p{Pd}, \p{Pattern_Syntax}, or \p{Script=Greek} — that may exist in some regex compilers that support that notation.

The issue is that in some of these, the set of possible properties is user‐extensible. That means you can have custom properties that are actual code callouts to named functions in some particular namepace, like \p{GoodChars} or \p{Class::Good_Characters}. How your language handles those might be worth looking at.

Sandboxing

In Perl, a sandboxed compartment via the Safe module would give control over namespace visibility. Other languages offer similar sandboxing technologies. If such devices are available, you might want to look into them, because they are specifically designed for limited execution of untrusted code.


Adding to tchrist's excellent answer: the same Russ Cox who wrote the "Regular Expression" page has also released code! re2 is a C++ library which guarantees O(length_of_regex) runtime and configurable memory-use limit. It's used within Google so that you can type a regex into google code search -- meaning that it's been battle tested.


Yes.

Regexes can be used to perform DOS attacks.
There is no simple solution.


You'll want to read this paper:

Insecure Context Switching: Inoculating regular expressions for survivability The paper is more about what can go wrong with regular expression engines (e.g. PCRE), but it may help you understand what you're up against.