public class TestObject
{
string TestValue { get; set; }
bool IsDuplicate { get; set; }
}
List<TestObject> testList = new List<TestObject>
{
new TestObject { TestValue = "Matt" },
new TestObject { TestValue = "Bob" },
new TestObject { TestValue = "Alice" },
new TestObject { TestValue = "Matt" },
new TestObject { TestValue = "Claire" },
new TestObject { TestValue = "Matt" }
};
Imagine testList
is actually millions of objects long.
What's the fastest way to ensure that two of those three TestObjects
with TestValue
of Matt gets its IsDuplicate
set to true? No matter how may instances of a given value there are, only one should come out of the process with IsDuplicate of false.
I am not averse to doing this via threading. And the collection doesn't have to be a list if converting it to another collection type is faster.
I need to keep duplicates and mark them as such, not remove them from the collection.
To expand, this is (as you might imagine) a simple expression of a much more complex problem. The objects in question already have an ordinal which I can use to order them.
After matching initial duplicates on exact string equality, I'm going to have to go back through the collection again and re-try the remainder using some fuzzy matching logic. The collection that exists at the start of this process won't be changed during the deduplication, or afterwards.
Eventually the original collection is going to be written out to a file, with likely duplicates flagged.
As others mentioned, the correct approach here would be to use the HashSet class.
var hashSet = new HashSet<string>();
foreach (var obj in testList)
{
if (!hashSet.Add(obj.TestValue))
{
obj.IsDuplicate = true;
}
}
When you add a value first time to the HashSet, it adds successfully and HashSet.Add()
method returns true so you don't make any changes to the item. When you're trying to add it second time, HashSet.Add()
returns false and you mark your item as a duplicate.
The list will have the following state after finishing running our marking duplicates method:
Matt
Bob
Alice
Claire
Matt DUPLICATE
This is probably quite performant:
foreach (var dupe in testList.GroupBy(x => x.TestValue).SelectMany(g => g.Skip(1)))
dupe.IsDuplicate = true;
[EDIT] This method turns out to be about a third of the speed of the accepted answer above, so that one should be used. This answer is merely of academic interest.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With