We have the following rules for our username validation:
We have the following regex for the same:
^(([a-zA-Z0-9]+[_-]*[a-zA-Z0-9]*)([\\.]*[a-zA-Z0-9])*)+$
Now trying to match a particular String, the CPU usage grows exponentially. For e.g:
[email protected]
Obviously, matching a String like above should fail instantly but I want to know why is it taking that much CPU cycles. Final code:
import java.util.regex.*;
public class R {
static final Pattern namePattern = Pattern.compile("^(([a-zA-Z0-9]+[_-]*[a-zA-Z0-9]*)([\\.]*[a-zA-Z0-9])*)+$");
public static void main(String... args) {
final String userName = "[email protected]";
Matcher matcher = namePattern.matcher(userName);
System.out.println(matcher.matches());
}
}
On a different lines, I rewrote the regex like below and it fairs well:
^[\\w]+[\\w-_\\.]*[\\w]+$
When regular expression engines use backtracking a lot, the process of matching becomes very slow. Backtracking happens a lot when you let different parts of your expression match overlapping parts of your input, especially when there is no match: the engine tries different possibilities of splitting up the input among the portions of your regular expression.
Consider this simple example from your regex: let's use [a-zA-Z0-9]+[_-]*[a-zA-Z0-9]*
to match M45766235H.
Note that there are two sub-expressions that can cover all characters starting with the second one: the engine must consider all these possibilities:
[a-zA-Z0-9]+ [a-zA-Z0-9]*
------------ ------------
M45766235H <nothing>
M45766235 H
M4576623 5H
M457662 35H
M45766 235H
M4576 6235H
M457 66235H
M45 766235H
M4 5766235H
M 45766235H
Given that there is no match, that's ten useless repetitions right there. But that's not the end of it! When there are other sub-expressions in the mix that could produce ambiguous coverage, these ten possible matches will be tried for each of the possible matches further on in the string.
To say that the effects of backtracking add up quickly would be an understatement: they multiply in geometric progression, eventually consuming a lot of your CPU.
The moral of this story is to try and reduce the amount of backtracking. For example, your second expression
^[\\w]+[\\w-_\\.]*[\\w]+$
can be rewritten like this:
^\\w[\\w-_\\.]*\\w$
The two expressions will match the same set of inputs, but when there is a match, the second expression will have a unique match, while the original expression would have roughly (N-2)^3
different ways of splitting the matched string among the three sub-expressions that match word characters.
Here is some more reading on a related subject: Performance of Greedy vs. Lazy Quantifiers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With