Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

apache commons-validator alternative for new gTLDS

I need to validate emails and domains. I just need a formal validation, no whois or other forms of domain lookup needed.

Currently I'm using apache's commons-validator v1.4.0

Unfortunately my customers use the new gTLDs, like .bike or .productions that are not yet supported by the DomainValidator class. See Apache's Jira issue for more details.

Are there any sound alternatives that I may easily include in my Maven POM?

like image 269
Iacopo Avatar asked Nov 25 '14 16:11

Iacopo


2 Answers

If you are not concerned about internationalized addresses, you could change last part of address, and continue to use Apache commons.

This approach is based on the fact that whatever the TLD is, the validity of the whole domain name is equivalent to the validity of the same domain name with the TLD replaced with com. For example:

  • abc.def.com is valid. Similarly abc.def.name, abc.def.xx--kput3i, abc.def.uk are valid.
  • ab,de.com is not valid. Similarly ab,de.name, ab,de.xx-kput3i, ab,de.uk are not valid.

So instead of calling

return EmailValidator.getInstance().isValid(userEmail);

You can call

if ( userEmail == null ) {
    return false;
}
return EmailValidator.getInstance().isValid(userEmail.trim().replaceFirst("\\.\\p{Alpha}[\\p{Alnum}-]*\\p{Alnum}$", ".com"));

Explanation

  • The regular expression "\\.\\p{Alpha}[\\p{Alnum}-]*\\p{Alnum}$" checks for the TLD part: it's at the end of the string (because of the $), it starts with a dot and contains no other dot, and it conforms to the standards: begins with an ASCII Alpha character, followed by zero or more alphanumerics or dashes, and ends with an alphanumeric character.
  • I am using trim() because until now, if you used EmailValidator, it allows spaces before and after the address. Removing the spaces just makes it easier to replace the TLD, and it shouldn't matter as far as the validity of the address is concerned.
  • If the string doesn't have a valid TLD at the end, String.replaceFirst() will return it as is. It could still be valid, because email addresses of the format x@[n.n.n.n] where n.n.n.n. is a valid IP address are valid. So basically, if you didn't find a TLD, you let EmailValidator decide the validity issue itself.
  • Of course, if the TLD is not an IANA recognized TLD, this validation will not tell you that. An e-mail like [email protected] will be accepted as legal,but IANA doesn't have that TLD as yet.

Checking a domain is similar, without the trim() part:

if (userDomain == null ) {
   return false;
}
return DomainValidator.getInstance().isValid(userDomain.replaceFirst("\\.\\p{Alpha}[\\p{Alnum}-]*\\p{Alnum}$"));

I have also tried JavaMail's email address validation, but I don't really like it: it allows completely invalid domain names such as net-name.net- (ending with a dash) or IP addresses (which are not allowed for e-mail without square brackets around them), and it's only good for e-mail addresses, not for domains.

Internationalization

If you need to check for internationalized domains and e-mails, it's a bit different. It's easy to check for internationalized domains (for example 元気。テスト). All you need to do is convert them to ASCII with java.net.IDN.toASCII() (yielding xn--z4qx76d.xn--zckzah for my example domain - this is a valid TLD), and then do the same as I wrote above.

Internationalized e-mails are a different story. If the local part is ASCII, you can convert the domain part to ASCII. If you have to display the email address, you need to use the Unicode version, and if you have to send an email message, you use the ASCII version.

But recently a standard has been introduced for internationalized local parts as well, which also allows sending to the unicode version of the domain name without translating it to ASCII first. Whether you want to support that or not requires some thought, as not many mail servers and mail transfer agents support it at the moment.

like image 169
RealSkeptic Avatar answered Oct 22 '22 11:10

RealSkeptic


Copied the implementation from DomainValidator and replaced the TOP_LABEL_REGEX expression with "\\p{Alpha}[\\p{Alnum}-]*\\p{Alpha}".

In addition, I removed validation against the hard coded list of approved gTLDs. This is, basically, quite weak in that it doesn't validate against the actual domains. But I think it's good enough (catches the gTLDs similar to XN--YGBI2AMMX).

See full list of approved gTLDs here.

// Copied from org.apache.commons.validator.routines.DomainValidator
private static final String DOMAIN_LABEL_REGEX = "\\p{Alnum}(?>[\\p{Alnum}-]*\\p{Alnum})*";
// Changed to include new gTLD - http://data.iana.org/TLD/tlds-alpha-by-domain.txt
private static final String TOP_LABEL_REGEX = "\\p{Alpha}[\\p{Alnum}-]*\\p{Alpha}";
// Copied from org.apache.commons.validator.routines.DomainValidator
private static final String DOMAIN_NAME_REGEX = "^(?:" + DOMAIN_LABEL_REGEX + "\\.)+" + "(" + TOP_LABEL_REGEX + ")$";
private static final RegexValidator domainRegex = new RegexValidator(DOMAIN_NAME_REGEX);

private static final EmailValidator EMAIL_VALIDATOR = new EmailValidator();

public static boolean isValidDomain(String domain) {
    String[] groups = domainRegex.match(domain);
    return groups != null && groups.length > 0;
}
like image 20
AlikElzin-kilaka Avatar answered Oct 22 '22 12:10

AlikElzin-kilaka