Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Library for canonicalizing (normalizing but NOT just cleansing) email addresses

There are multiple ways to produce email address Strings that differ with straight String comparison (see below), but are logically equivalent (i.e. mail sent to both goes to same mail box). This often allows users to give seemingly unique email addresses, even if strict equality was disallowed.

I was hoping to find a library that would try to do normalization, to allow for finding some of duplicates from large sets of email addresses. Goal here is to find as many duplicates as possible. Given how useful this is for multiple purposes (in my case it is simple abuse detection, as abuse accounts tend to (try to) just reuse certain accounts), I am thinking there might be existing solutions.

So what kind of things can vary? I know of at least things like:

  • domain name part is case-insensitive (as per DNS); but local part may or may not be, this depends on mail provider (for example, Gmail considers it case-insensitive)
  • many domains have aliases (googlemail.com is equivalent to gmail.com)
  • some email providers allow other variations that they ignore (gmail, for example, ignores any dots in email address!)

Ideally this would be in Java, although scripting languages would also work (command-line tool)

like image 263
StaxMan Avatar asked Jun 30 '11 23:06

StaxMan


1 Answers

I could find a few bits off code on Google by searching for "normalize email address", but nothing nearly thorough enough. I'm afraid you would have to write your own tool. If I were to write such a tool, here are a few rules I think I would apply:

First the tool would lower the case of the domain name (after the @). It shouldn't be too hard, unless you want to handle emails with international domain names. For example, JoE@caFÉ.fR (note the accent on the E) should first go through the Nameprep algorithm. This leads to [email protected]. I have never seen anyone with such an international email address, but I suspect you might find some in China or Japan, for example.

RFC 5322 states that the local-part of the email (before the @) is case sensitive, but the de facto standard for virtually all providers is to ignore case (I have never seen a case-sensitive email address actually used by a human being, but I suppose there are still some sysadmins out there who use their Un*x email accounts, where case does matter). I think the tool should have an option to ignore case for a list of domain names (or on the contrary, to be case sensitive only for a list of domain names). So at this point, the email address JoE@caFÉ.fR is now normalized to [email protected].

Once again, the question of international (aka. non ASCII) email addresses pops up. What if the local-part is non-ASCII ? For example something like 甲斐@黒川.日本 (disclaimer: I don't speak Japanese). RFC 5322 forbids this, but more recent RFCs do support this (see this wikipedia article). A lot of languages have no notion of lower or uppercase. When they do, if you want to change to the lower-case form, make sure to use the appropriate Unicode lower-case algorithms, it's not always trivial. For example, in German, the lower case of the word "Großes" may be either "grosses" or "großes" (disclaimer: I don't speak German either). So at this point, the email address "Großes@caFÉ.Fr" should have been normalized to "[email protected]".

I haven't read RFC 5322 in detail but I think there's also a possibility to have comments in an email address, either at the beginning or at the end of the local part, such as (sir)[email protected] or john.lennon(ono)@beatles.com. These comments should be stripped (this would lead to [email protected]. Stripping the comments is not entirely trivial because I don't know what to do with the nested comments, and also comments enclosed in double-quotes should not be stripped, according to the RFC (unless I am mistaken). For example, the comment in the following email address should not be stripped, according to the RFC: "john.(ono).lennon"@beatles.com.

Once the email is thus normalized, I would apply the "provider-specific" rules you suggest. For example stripping the dots in GMail addresses and mixing equivalent domain names (googlemail.com == gmail.com for example). I think I would keep this really separate from the previous normalization steps.

Note that Gmail also ignores the plus sign (+) and everything after it, for example [email protected] is equivalent to [email protected].

I'm not aware of other provider rules. The thing is, these rules may change at any time, you would have to keep track of them all.

I think that's about it. If you come up with some working code, I would be really interested to see it.

Cheers!

like image 114
MiniQuark Avatar answered Oct 19 '22 22:10

MiniQuark