Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex for a username increases CPU consumption

Tags:

java

regex

We have the following rules for our username validation:

  • Username can have alphanumeric characters
  • Username can have an underscore, hyphen or a period
  • For now assume the username is in ASCII
  • Username cannot start or end with a period
  • Username cannot start, end or have any spaces

We have the following regex for the same:

^(([a-zA-Z0-9]+[_-]*[a-zA-Z0-9]*)([\\.]*[a-zA-Z0-9])*)+$

Now trying to match a particular String, the CPU usage grows exponentially. For e.g:

[email protected]

Obviously, matching a String like above should fail instantly but I want to know why is it taking that much CPU cycles. Final code:

import java.util.regex.*;
public class R {
    static final Pattern namePattern = Pattern.compile("^(([a-zA-Z0-9]+[_-]*[a-zA-Z0-9]*)([\\.]*[a-zA-Z0-9])*)+$");

    public static void main(String... args) {
        final String userName = "[email protected]";
        Matcher matcher = namePattern.matcher(userName);
        System.out.println(matcher.matches());
    }
}

On a different lines, I rewrote the regex like below and it fairs well:

^[\\w]+[\\w-_\\.]*[\\w]+$
like image 806
user320599 Avatar asked Apr 29 '13 20:04

user320599


1 Answers

When regular expression engines use backtracking a lot, the process of matching becomes very slow. Backtracking happens a lot when you let different parts of your expression match overlapping parts of your input, especially when there is no match: the engine tries different possibilities of splitting up the input among the portions of your regular expression.

Consider this simple example from your regex: let's use [a-zA-Z0-9]+[_-]*[a-zA-Z0-9]* to match M45766235H. Note that there are two sub-expressions that can cover all characters starting with the second one: the engine must consider all these possibilities:

[a-zA-Z0-9]+ [a-zA-Z0-9]*
------------ ------------
M45766235H   <nothing>
M45766235    H
M4576623     5H
M457662      35H
M45766       235H
M4576        6235H
M457         66235H
M45          766235H
M4           5766235H
M            45766235H

Given that there is no match, that's ten useless repetitions right there. But that's not the end of it! When there are other sub-expressions in the mix that could produce ambiguous coverage, these ten possible matches will be tried for each of the possible matches further on in the string.

To say that the effects of backtracking add up quickly would be an understatement: they multiply in geometric progression, eventually consuming a lot of your CPU.

The moral of this story is to try and reduce the amount of backtracking. For example, your second expression

^[\\w]+[\\w-_\\.]*[\\w]+$

can be rewritten like this:

^\\w[\\w-_\\.]*\\w$

The two expressions will match the same set of inputs, but when there is a match, the second expression will have a unique match, while the original expression would have roughly (N-2)^3 different ways of splitting the matched string among the three sub-expressions that match word characters.

Here is some more reading on a related subject: Performance of Greedy vs. Lazy Quantifiers.

like image 74
Sergey Kalinichenko Avatar answered Sep 22 '22 15:09

Sergey Kalinichenko