Best way to detect similar email addresses?

Tags:

I have a list of ~20,000 email addresses, some of which I know to be fraudulent attempts to get around a "1 per e-mail" limit, such as username1@gmail.com, username1a@gmail.com, username1b@gmail.com, etc. I want to find similar email addresses for evaluation. Currently I'm using a Levenshtein algorithm to check each e-mail against the others in the list and report any with an edit distance of less than 2. However, this is painstakingly slow. Is there a more efficient approach?

The test code I'm using now is:

Click to copy

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Threading;

namespace LevenshteinAnalyzer
{
    class Program
    {
        const string INPUT_FILE = @"C:\Input.txt";
        const string OUTPUT_FILE = @"C:\Output.txt";

        static void Main(string[] args)
        {
            var inputWords = File.ReadAllLines(INPUT_FILE);
            var outputWords = new SortedSet<string>();

            for (var i = 0; i < inputWords.Length; i++)
            {
                if (i % 100 == 0) 
                    Console.WriteLine("Processing record #" + i);

                var word1 = inputWords[i].ToLower();
                for (var n = i + 1; n < inputWords.Length; n++)
                {
                    if (i == n) continue;
                    var word2 = inputWords[n].ToLower();

                    if (word1 == word2) continue;
                    if (outputWords.Contains(word1)) continue;
                    if (outputWords.Contains(word2)) continue;
                    var distance = LevenshteinAlgorithm.Compute(word1, word2);

                    if (distance <= 2)
                    {
                        outputWords.Add(word1);
                        outputWords.Add(word2);
                    }
                }
            }

            File.WriteAllLines(OUTPUT_FILE, outputWords.ToArray());
            Console.WriteLine("Found {0} words", outputWords.Count);
        }
    }
}

Edit: Some of the stuff I'm trying to catch looks like:

01234567890@gmail.com
0123456789@gmail.com
012345678@gmail.com
01234567@gmail.com
0123456@gmail.com
012345@gmail.com
01234@gmail.com
0123@gmail.com
012@gmail.com

956

asked May 11 '10 16:05

Chris

1 Answers

You could start by applying some prioritization to which emails to compare to one another.

A key reason for the performance limitations is the O(n²) performance of comparing each address to every other email address. Prioritization is the key to improving performance of this kind of search algorithm.

For instance, you could bucket all emails that have a similar length (+/- some amount) and compare that subset first. You could also strip all special charaters (numbers, symbols) from emails and find those that are identical after that reduction.

You may also want to create a trie from the data rather than processing it line by line, and use that to find all emails that share a common set of suffixes/prefixes and drive your comparison logic from that reduction. From the examples you provided, it looks like you are looking for addresses where a part of one address could appear as a substring within another. Tries (and suffix trees) are an efficient data structure for performing these types of searches.

Another possible way to optimize this algorithm would be to use the date when the email account is created (assuming you know it). If duplicate emails are created they would likely be created within a short period of time of one another - this may help you reduce the number of comparisons to perform when looking for duplicates.

answered Sep 19 '22 09:09

LBushkin

Related questions
                            
                                Can't catch native exception in managed code
                            
                                Assembly reference not found in XAML, but code compiles when referenced in xaml.cs class
                            
                                GroupBy with elementSelector and resultSelector
                            
                                Create custom winforms container
                            
                                Mixed authentication for OWIN
                            
                                Linq: GroupBy vs Distinct
                            
                                Approximating an ellipse with a polygon
                            
                                Using Startup class in ASP.NET5 Console Application
                            
                                Why does this nested object initializer throw a null reference exception?
                            
                                How to suppress code analysis messages for all type members?
                            
                                Why does 'Any CPU (prefer 32-bit)' allow me to allocate more memory than x86 under .NET 4.5?
                            
                                App redirects to Account/AccessDenied on adding Oauth
                            
                                Does .NET Task.Result block(synchronously) a thread [duplicate]
                            
                                "PDB format is not supported" with .NET portable debugging information
                            
                                ASP.NET Core 2.2 WebAPI 405 Method Not Allowed
                            
                                Home Automation Library [closed]
                            
                                Can you Pass Func<T,bool> Through a WCF Service?
                            
                                Lock Windows workstation programmatically in C#
                            
                                Snapshot History With Entity Framework
                            
                                How to find control points for a BezierSegment given Start, End, and 2 Intersection Pts in C# - AKA Cubic Bezier 4-point Interpolation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Best way to detect similar email addresses?

Tags:

c#

levenshtein-distance

Chris

People also ask

1 Answers

LBushkin

Recent Activity

Donate For Us