Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's wrong with this RegEx for validating emails?

Here's a regex for validating emails - \S+@\S+\.\S+, I didn't write it. I'm new to Regular Expressions and do not understand them all that well.

I have a couple of questions:

  1. What's wrong with the above RegEx?
  2. What's a good RegEx for validating emails?
like image 535
Tesseract Avatar asked Jul 02 '09 20:07

Tesseract


Video Answer


2 Answers

"How do I validate an email with regex" is one of the more popular questions that come up when it comes to regular expressions and the only real good answer is "you don't". It has been discussed in this very website in many occasions. What you have to understand is that if you really wanted to follow the spec, your regex would look something like this. Obviously that is a monstrosity and is more an exercise in demonstrating how ridiculously difficult it is to adhere to what you are supposed to be able to accept. With that in mind, if you absolutely positively need to know that the email address is valid, the only real way to check for that is to actually send a message to the email address and check if it bounces or not. Otherwise, this regex will properly validate most cases, and in a lot of situations most cases is enough. In addition, that page will discuss the problems with trying to validate emails with regex.

like image 51
Paolo Bergantino Avatar answered Sep 20 '22 18:09

Paolo Bergantino


I'm only going to answer your first question, and from a technical regex point of view.

What is wrong with the regex \S+@\S+\.\S+, is that it has the potential to execute way too slowly. What happens if somebody enters an email string like the one below, and you need to validate it?

[email protected]

Or even worse (yes, that are 100 @'s after the dot):

@.@@@@@@@@@@@@@@@@@@@@@@@@@ \ @@@@@@@@@@@@@@@@@@@@@@@@@ \ @@@@@@@@@@@@@@@@@@@@@@@@@ \ @@@@@@@@@@@@@@@@@@@@@@@@@

Slowliness happens. First the regex would greedily match as many characters as possible for the first \S+. So, it will initially match the whole string. Then we need the @ character, so it will backtrack until it finds one. At that point we've got another \S+, so, again it will consume everything until the end of the string. Then it needs to backtrack again until it finds a dot. Can you imagine how much backtracking occurs before the regex finally fails on the second email string?

To kill the backtracking, I suggest using possessive character classes in this case, which have the additional benefit of not allowing multiple @'s in one string.

[^@\s]++@[^@\s.]++\.[^@\s]++

I did a quick benchmark for the two regexes against the “100 @'s email”. Mine is about 95 times faster.

like image 39
Geert Avatar answered Sep 18 '22 18:09

Geert