Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is jQuery's email validation regex so simple?

We all know that a regex to validate emails properly would be quite complicated. However, jQuery's validation plugin has a shorter regex (contributed by Scott Gonzalez), spanning only a few lines:

/^((([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
+(\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*)|
((\x22)((((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|
[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(\\([\x01-\x09\x0b\x0c\x0d-\x7f]
|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))))*(((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?
(\x22)))@((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|
[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*
([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|
[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|
[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?$/

Why is this so 'simple' compared to the more well-known monstrosity? Are there cases where one regex would fail and the other would succeed (whether the cases are valid or invalid emails)?

like image 431
configurator Avatar asked Dec 01 '10 02:12

configurator


2 Answers

The regex is a custom combination of:

  • RFC 2234 ABNF
  • RFC 2396 URI Generic Syntax (obseleted by RFC 3986)
  • RFC 2616 Hypertext Transfer Protocol -- HTTP/1.1
  • RFC 2822 Internet Message Format
  • RFC 3987 IRI
  • RFC 3986 URI Generic Syntax

I wrote the regex when Web Forms 2.0 was being drafted and RFC 5322 did not exist. If you look at the order in which the RFCs were written, you'll notice that the definition for IRI and URI changed after Internet Message Format was written. This means that RFC 2822 does not support current IRI definitions. Unfortunately, it wasn't a simple task of just substituting definitions, so I had to pick and choose which definitions to use from which RFCs. I also made choices about what to remove (like support for comments).

The regex is not fully hand-written. While I did manually write every section of the regex, I scripted the "glue". Each definition from the RFCs is stored in a variable, with compound definitions utilizing the variables that store the simpler definitions (@Walf: this is why there are so many subpatterns and ors).

To complicate the matter, the version of the regex that is used in the jQuery Validation plugin is modified even further to account for differences between spec-valid addresses and user expectation of a valid address. I have no recollection of what modifications I made. I promised Jörn Zaefferer (the author of the validation plugin) that I would write a newer script to generate the regex. The new script would allow you to specify options for what you do and don't want to support (required TLD, specific TLDs, IPv6, comments, obsolete defintions, quoted local names, etc.). That was 5 years ago. I started it once, but never finished. Maybe one day I will. What I have so far is hosted on GitHub: https://github.com/scottgonzalez/regex-builder

If you want a regex for validating email addresses, I'd suggest the following regex which is included in the HTML5 specification:

/^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/

If you use regex-builder and turn off all the options, you'll get something similar. But it's been about a year since I looked at that, so I don't remember what the differences are.


I'd also like to point out that the link in the original question specifically mentions RFC 822. While it's great that RFC 822 advanced us from Arpanet to the ARPA Internet, this isn't exactly current. The Internet has made a few advances in the past three decades and this RFC has been superseded twice. I'd like to see any new work following the latest standards.


UPDATE:

A friend asked me why the HTML5 regex doesn't support UTF-8. I've never asked Hixie about it, but I assume this is the reason: Even though some TLDs started to support IDNs (International Domain Names) in 2000 and RFC 3987 (IRI) was written in 2005, when RFC 5322 was written in 2008 it only listed characters in the ranges 33-90 and 94-126 as valid dtext (characters allowed for use in a domain literal). HTML5 is based on RFC 5322 and as a result there is no UTF-8 support. It certainly seems strange that RFC 5322 doesn't account for IDNs, but it's worth nothing that even in 2008 IDNs weren't actually usable. It wasn't until 2010 that ICANN approved the first set of IDNs. However, even today if you want to use an IDN, you pretty much need to completely destroy your domain name using Punycode if you actually want things like email and DNS to work globally.

UPDATE 2:

Updated HTML5 regex to match the updated spec, which changed label length limits from 255 characters to 63 characters, as specified in RFC 1034 section 3.5.

like image 129
Scott González Avatar answered Oct 16 '22 15:10

Scott González


That doesn’t look right: what’s with the Unicode? Which RFC is this validating against?

See this answer for a proper RFC5322-validating regex.

like image 40
tchrist Avatar answered Oct 16 '22 14:10

tchrist