Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# - fastest way of comparing a collection against itself to find duplicates

public class TestObject
{
    string TestValue { get; set; }
    bool IsDuplicate { get; set; }
}

List<TestObject> testList = new List<TestObject>
{
    new TestObject { TestValue = "Matt" },
    new TestObject { TestValue = "Bob" },
    new TestObject { TestValue = "Alice" },
    new TestObject { TestValue = "Matt" },
    new TestObject { TestValue = "Claire" },
    new TestObject { TestValue = "Matt" }
};

Imagine testList is actually millions of objects long.

What's the fastest way to ensure that two of those three TestObjects with TestValue of Matt gets its IsDuplicate set to true? No matter how may instances of a given value there are, only one should come out of the process with IsDuplicate of false.

I am not averse to doing this via threading. And the collection doesn't have to be a list if converting it to another collection type is faster.

I need to keep duplicates and mark them as such, not remove them from the collection.

To expand, this is (as you might imagine) a simple expression of a much more complex problem. The objects in question already have an ordinal which I can use to order them.

After matching initial duplicates on exact string equality, I'm going to have to go back through the collection again and re-try the remainder using some fuzzy matching logic. The collection that exists at the start of this process won't be changed during the deduplication, or afterwards.

Eventually the original collection is going to be written out to a file, with likely duplicates flagged.

like image 748
Bob Tway Avatar asked Dec 02 '22 14:12

Bob Tway


2 Answers

As others mentioned, the correct approach here would be to use the HashSet class.

var hashSet = new HashSet<string>();

foreach (var obj in testList)
{
    if (!hashSet.Add(obj.TestValue))
    {
        obj.IsDuplicate = true;
    }
}

When you add a value first time to the HashSet, it adds successfully and HashSet.Add() method returns true so you don't make any changes to the item. When you're trying to add it second time, HashSet.Add() returns false and you mark your item as a duplicate.

The list will have the following state after finishing running our marking duplicates method:

Matt
Bob
Alice
Claire
Matt DUPLICATE
like image 83
Ivan Yurchenko Avatar answered Apr 29 '23 16:04

Ivan Yurchenko


This is probably quite performant:

foreach (var dupe in testList.GroupBy(x => x.TestValue).SelectMany(g => g.Skip(1)))
    dupe.IsDuplicate = true;

[EDIT] This method turns out to be about a third of the speed of the accepted answer above, so that one should be used. This answer is merely of academic interest.

like image 32
Matthew Watson Avatar answered Apr 29 '23 16:04

Matthew Watson