Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why Regex IsMatch() hangs

Tags:

c#

.net

regex

I have an exepression to validate an email address:

string REGEX_EMAIL = @"^\w+([\.\!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\¦\}\~]*\w+)*@\w+([\.\-]\w+)*\.\w+([\.\-]\w+)*$";

If address is correct IsMatch() method quickly shows true result. But if address string is long and wrong this method hangs.

What can I do to raise speed of this method?

Thanks.

like image 390
Vyacheslav Avatar asked Nov 29 '11 19:11

Vyacheslav


2 Answers

You are experiencing catastrophic backtracking.

The simplified regex:

Regex regexObj = new Regex(@"^\w+([-.!#$%&'*+/=?^`{¦}~]*\w+)*@\w+([.-]\w+)*\.\w+([.-]\w+)*$");

Has potential problems e.g. ([.-]\w+)*\.

If the . is missing for example and you have a long string of characters before it then all possible combinations must be taken into account for your regex to find out that it actually fails.

like image 100
FailedDev Avatar answered Nov 13 '22 02:11

FailedDev


You have a couple things going on which are hurting the performance in this regular expression.

  1. Catastrophic backtracking
  2. Too many optional statements

You can definitely improve performance by using the + instead of the * in a few key places, but this of course changes what the regular expression will and won't match. So the easiest fix I found is actually covered in the catastrophic backtracking article above. You can use the nonbacktracking subexpression to drastically improve performance in this case, without changing the regular expression's behavior in any way that matters.

The nonbacktracking subexpression looks like this... (?>pattern)

So try this regular expression instead:

^\w+(?>[\.\!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\¦\}\~]*\w+)*@\w+([\.\-]\w+)*\.\w+([\.\-]\w+)*$

On a slightly related topic, my philosophy for checking for a valid email address is a bit different. For one, long regular expressions like this one can potentially have performance problems as you've found.

Secondly, there's the upcoming promise of email address internationalization which complicates all of this even more.

Lastly, the main purpose of any regular expression based email validation is to catch typos and blatant attempts to get through your form without entering a real email address. But to check if an email address is genuine requires that you send an email to that address.

So my philosophy is to err on the side of accepting too much. And that, in fact, is a very simple thing to do...

^.+@.+\..+$

This should match any conceivably valid email address, and some invalid ones as well.

like image 4
Steve Wortham Avatar answered Nov 13 '22 00:11

Steve Wortham