Can it cause harm to validate email addresses with a regex?

Tags:

I've heard that it is a bad thing to validate email addresses with a regex, and that it actually can cause harm. Why is that?

I thought it never could be a bad thing to validate data. Maybe unnecessary, but never a bad thing provided that you perform the validation correctly. Why is this right or wrong? If it can cause harm, please give an example.

600

asked Jan 02 '18 03:01

klutt

Video Answer

4 Answers

In general, yes - using regular expressions to validate email addresses is harmful. This is because of bad (incorrect) assumptions by the author of the regular expression.

As klutt indicated, an email address has two parts, the local-part and the domain. It's worth noting some things about these parts that aren't immediately obvious:

The local-part can contain escaped characters and even additional @ characters.
The local-part can be case sensitive, however it is up to the mail server at that specific domain how it wants to distinguish case.
The domain part can contain zero or more labels separated by a period (.), though in practice there are no MX records corresponding to the root (zero labels) or on the TLDs (one label) themselves.

So, there are some checks that you can do without rejecting valid email addresses that correspond with the above:

Address contains at least one @
The local-part (everything to the left of the rightmost @) is non-empty
The domain part (everything to the right of the rightmost @) contains at least one period (again, this isn't strictly true, but pragmatic)

That's it. As others have pointed out, it's best practice to test deliverability to that address. This will establish two important things:

Whether the email currently exists; and
That the user has access to the email address (is the legitimate user or owner)

If you build email activation processes into your business process, you don't need to worry about complicated regular expressions that have issues.

Some further reading for reference:

RFC 5321: Simple Mail Transfer Protocol

OWASP: Input Validation Cheat Sheet

154

answered Nov 15 '22 05:11

bly

TL;DR

Don't use regexes for validating emails, unless you have a good reason not to. Use a verification mail instead. In most cases, a regex that simply checks that the string contains an @ is enough.

Short version

Constructing regexes for validating emails can be a good and fun exercise, but in general, you should really avoid it in production code. The proper way of verifying an email address is in most cases to send a verification mail. Trying to verify if a mail address matches the specification is very tricky, and even if you get it right, it's still often useless information unless you know that it's a mail address that you can send mails to and that someone reads.

Think of it. How often do you have use for storing a mail address that's wrong?

If you're just want to make sure that a user does not mix up input fields, check that the mail address contains a @ character. That's enough. Well, it would not catch those who insists on that character in user names or passwords, but that's their head ache. ;)

Long version

In a majority of the cases where you would want to use this, just knowing that the email address is valid does not mean a thing. What you really want to know is if it is the right email address.

The reason may differ. You may want to send newsletters, use it for regular communication, password recovery or something else. But whatever it is, it's important that it is the right address. It's not important to know if the address fulfills a complicated standard. The only important thing is to know if it can be used for the purpose you have of storing the address.

The proper way to verify this is by sending a mail with a verification link.

If you have verified the email address with a verification link, there's often no point in checking if it is a correct email address, since you know it works. It could however be used for basically checking that the user is entering the email address in the correct field. My advice in this case is to be extremely forgiving. I'd say it's enough to just check that it is a @ in the field. It's a simple check and ALL email addresses includes a @. If you want to make it more complicated than that, I would suggest just warning the user that it might be something wrong with the address, but not forbidding it. A pretty simple regex that would have extremely few false negatives (if any) is

.+@.+\..+

This means a non empty string before @ followed by a non empty domain, a dot and a non empty top domain. But actually, I'd just stick with @.+ which means that the right part is non empty, and I don't know of any dns server that would accept an empty server name.

Properly checking an email against the standard is actually really tricky

But one worse concern is that a regex for accurately verifying an email address is actually a very complex matter. If you try to create a regex on your own, you will almost certainly make mistakes. One thing worth mentioning here is that the standard RFC 5322 does allow comments within parentheses. To make things worse, nested comments are allowed. A standard regex cannot match nested patterns. You will need extended regex for this. While extended regexes are not unusual, it does say something about the complexity. And even if you get it right, will you update the regex when a new standard comes?

The mail server might support non-standard addresses

And one more thing, even if you get it 100% right, that still may not be enough. An email address has the local part on the left side of the @ and domain part on the right. Everything in the local part is meant to be handled by the server. Sure, RFC 5322 is pretty detailed about what a valid local part looks like, but what if a particular email server accepts addresses that is not valid according to RFC 5322? Are you really sure you don't want to allow a particular email address that does work just because it does not follow the standard? Do you want to lose customers for your business just because they have chosen an obscure email provider?

If you really want to check if an address is correct in production code, then use MailAddress class or something equivalent. But first take a minute to ponder if this really is what you want. Ask yourself if the address has any value if it is not the correct address. If the answer is no, then you don't. Use verification links instead.

That being said, it can be a good thing to validate input. The important thing is to know why you are doing it. Validating the email with a regex or (preferably) something like the Mailaddress class could give some protection against malicious input, such as SQL injections and such. But if this is the only method you have to protect you against malicious input, then you're doing something else very wrong.

answered Nov 15 '22 06:11

klutt

In addition to other answers, I would like to point out, that regex engines that use backtracking are susceptible to ReDoS - regex denial of service attacks. The attack is based on the fact that many non-trivial regular expressions have inputs that can take an extraordinary amount of CPU cycles to produce a non-match.

Crafting such an input might cause trouble to the availability of the site even with small botnet.

Mitigations of the issue:

it is often possible to rewrite the regex expression to avoid catastrophic backtracking; or:
using a regex engine without support for backtracking - while most support it, engines without such support do exist - a notable example would be the RE2 regex engine used by Go/Golang.

For more information: "Regular Expressions Denial of the Service (ReDoS) Attacks"

answered Nov 15 '22 05:11

Mindaugas Bernatavičius

If your regular expression is ill-formed then you might deny valid email addresses. This goes for any "email validation" rule.

I know of an email address which is regularly denied by forms which doesn't contain any email oddities; it's merely long. It really annoys the person it belongs to because the part before the @ is their legal name - an obvious choice for an email address.

That is part of the potential harm of email validation done incorrectly: annoying users by denying valid email addresses from entering the system.

answered Nov 15 '22 04:11

Levi Morrison

Related questions
                            
                                PHP: remove extra space from a string using regex
                            
                                Simple Java regex not working
                            
                                [^/]+ explanation in htaccess
                            
                                How can I extract a tag's attribute value from an HTML file?
                            
                                RegExp confusion
                            
                                How to check the validity of a GUID (or UUID) using NSRegularExpression or any other effective way in Objective-C
                            
                                How do i write a regex for capturing decimal numbers?
                            
                                Perl match only returning "1". Booleans? Why?
                            
                                Remove all special characters from a phone number string entry except + occurring only at first place [closed]
                            
                                .Net Removing all the first 0 of a string
                            
                                Regular Expression for Extracting Script Tags
                            
                                Javascript Regex - 9-digit number
                            
                                How do I insert a line above specific lines in a file using Vim or Perl?
                            
                                PHP replace string with values from array
                            
                                Need a simple RegEx to find a number in a single word
                            
                                Regular expression substring replacement in Microsoft Excel [closed]
                            
                                Stripping out html tags in string
                            
                                Removing parenthesis from a string in pandas with str.replace
                            
                                How to return a boolean value from a regex
                            
                                Regular expression to match a pattern inside awk command

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can it cause harm to validate email addresses with a regex?

Tags:

regex

validation

email

klutt

People also ask

Video Answer

4 Answers

bly

TL;DR

Short version

Long version

The proper way to verify this is by sending a mail with a verification link.

Properly checking an email against the standard is actually really tricky

The mail server might support non-standard addresses

klutt

Mindaugas Bernatavičius

Levi Morrison

Recent Activity

Donate For Us