Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detecting "near duplicates" using a LINQ/C# query

I'm using the following queries to detect duplicates in a database.

Using a LINQ join doesn't work very well because Company X may also be listed as CompanyX, therefore I'd like to amend this to detect "near duplicates".

var results = result
                .GroupBy(c => new {c.CompanyName})
                .Select(g => new CompanyGridViewModel
                    {
                        LeadId = g.First().LeadId,
                        Qty = g.Count(),
                        CompanyName = g.Key.CompanyName,
                    }).ToList();

Could anybody suggest a way in which I have better control over the comparison? Perhaps via an IEqualityComparer (although I'm not exactly sure how that would work in this situation)

My main goals are:

  1. To list the first record with a subset of all duplicates (or "near duplicates")
  2. To have some flexibility over the fields and text comparisons I use for my duplicates.
like image 824
Nick Avatar asked Dec 20 '25 16:12

Nick


1 Answers

For your explicit "ignoring spaces" case, you can simply call

var results = result.GroupBy(c => c.Name.Replace(" ", ""))...

However, in the general case where you want flexibility, I'd build up a library of IEqualityComparer<Company> classes to use in your groupings. For example, this should do the same in your "ignore space" case:

public class CompanyNameIgnoringSpaces : IEqualityComparer<Company>
{
    public bool Equals(Company x, Company y)
    {
        return x.Name.Replace(" ", "") == y.Name.Replace(" ", "");
    }

    public int GetHashCode(Company obj)
    {
        return obj.Name.Replace(" ", "").GetHashCode();
    }
}

which you could use as

var results = result.GroupBy(c => c, new CompanyNameIgnoringSpaces())...

It's pretty straightforward to do similar things containing multiple fields, or other definitions of similarity, etc.

Just note that your defintion of "similar" must be transitive, e.g. if you're looking at integers you can't define "similar" as "within 5", because then you'd have "0 is similar to 5" and "5 is similar to 10" but not "0 is similar to 10". (It must also be reflexive and symmetric, but that's more straightforward.)

like image 94
Rawling Avatar answered Dec 23 '25 06:12

Rawling



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!