Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RegExp and String combination crashes Chrome

I have the following RegExp to validate an email address:

^[A-Za-z0-9](([_\.\-]?[a-zA-Z0-9]+)*)@([A-Za-z0-9]+)(([\.\-]?[a-zA-Z0-9]+)*)\.([A-Za-z]{2,})$

Running it on a basic email works beautifully:

/^[A-Za-z0-9](([_\.\-]?[a-zA-Z0-9]+)*)@([A-Za-z0-9]+)(([\.\-]?[a-zA-Z0-9]+)*)\.([A-Za-z]{2,})$/.test('[email protected]');

But running it on a long string crashes Chrome:

/^[A-Za-z0-9](([_\.\-]?[a-zA-Z0-9]+)*)@([A-Za-z0-9]+)(([\.\-]?[a-zA-Z0-9]+)*)\.([A-Za-z]{2,})$/.test('dddddddddddddddddddddddddddddddddddddddd');

I've noticed it kicks in at around 40 characters

What is it about this RegExp which is so intensive?

like image 375
Dave Taylor Avatar asked Oct 09 '12 15:10

Dave Taylor


2 Answers

OK, let's see what's happening here. You have thankfully already simplified the problem down to what happens when you apply the regex

(d+)*@

to the string

ddddd

which clearly can't be matched because the @ is missing. But what's keeping the regex engine from figuring this out quickly?

Well, (d+)* in essence can be fulfilled by the following matches (each group separated by spaces):

ddddd
dddd d
ddd dd
dd ddd
d dddd
ddd d d
dd d dd
d d ddd
dd dd d
d dd dd
d ddd d
d d d dd
d d dd d
d dd d d
dd d d d
d d d d d

So you have one way of matching the string without breaking up the string, four ways to break it up into two strings, six ways to break it up into three, four to break it into four, and one to break it up into five strings. All these combinations have to be checked by the regex engine before it can declare failure to match when it finally has to face the following @.

Why doesn't it figure that out sooner? Well, some regex engines can probably do it with such a simplified example. I bet Larry Wall has that covered. But your actual regex is a bit more complex, so I guess that would be much harder to figure out beforehand. Plus, there is a simple way to ensure all this combination-trying doesn't happen. But I'll come back to that later.

I had been wondering how many such combinations there would be for a longer string than those puny five ds:

Let's take a string of length m (which can be cut apart in m-1 different positions). Let's say n = m-1. Then you can calculate the number of combinations as follows:

1 + (n!/(1! * (n-1)!)) + (n!/(2! * (n-2)!)) + ... + (n!/(n! * (n-n)!))

The first and last items are always 1, but the items in-between can get quite large. Let's write a small Python program:

>>> from math import factorial as f
>>> def comb(n,x):
...    return f(n) // (f(x) * f(n-x))
...
>>> def ways_to_split(len):
...    return 1 + sum(comb(len-1,x) for x in range(1,len))
...
>>> ways_to_split(5)
16

OK, seems to work. Let's try something bigger:

>>> ways_to_split(10)
512
>>> ways_to_split(40)
549755813888
>>> ways_to_split(100)
633825300114114700748351602688

Hey, there's a pattern here: ways_to_split(n) is equal to 2**(n-1). See Mathematics SE for a proof. Anyway. Your regex has O(2^n) complexity. Well, now you see why that might take a while...

Thankfully, many regex engines provide protection against this: possessive quantifiers or atomic capturing groups.

(d++)*

or

(?>d+)*

both ensure that whatever d+ has matched will not be relinquished again for trying other combinations. Good news, right?

Well, not if you use JavaScript. It doesn't support either of those features.

Therefore, you either need to rewrite your regex not to be susceptible to backtracking like this:

Instead of

(([_\.\-]?[a-zA-Z0-9]+)*)

use

[a-zA-Z0-9]+([_.-][a-zA-Z0-9]+)*

Or stop trying to use a regex to validate an e-mail address which doesn't work reliably, ever, anyway.

like image 119
Tim Pietzcker Avatar answered Oct 23 '22 03:10

Tim Pietzcker


I think it is related to your regex not the length of string. I used the following regex for email validation and it didnt crash on Chrome..

/^(([^<>()[\]\\.,;:\s@\"]+(\.[^<>()[\]\\.,;:\s@\"]+)*)|(\".+\"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/.test('dddddddddddddddddddddddddddddddddddddddd');
like image 2
scusyxx Avatar answered Oct 23 '22 03:10

scusyxx