Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find duplicates in a List<T> quickly, and update the original collection

Let me start by saying I've read these questions: 1 & 2, and I understand that I can write the code to find duplicates in my List, but my problem is I want to update the original list not just query and print the duplicates.

I know I can't update the collection the query returns as it's not a view, it's an anonymous type IEnumerable<T>.

I want to be able to find duplicates in my list, and mark a property I've created called State which is used later in the application.

Has anyone ran into this problem and can you point me in the right direction?

p.s. The approach I'm using ATM is a bubble sort type loop to go through the list item by item and compare key fields. Obviously this isn't the fastest method.

EDIT:

In order to consider an item in the list a "duplicate", there are three fields which must match. We'll call them Field1, Field2, and Field3

I have an overloaded Equals() method on the base class which compares these fields.

The only time I skip an object in my MarkDuplicates() method is if the objects state is UNKNOWN or ERROR, otherwise, I test it.

Let me know if you need more details.

Thanks again!

like image 339
Chris Avatar asked Apr 27 '09 05:04

Chris


People also ask

How do I find duplicates in collections?

For each element in the stream, count the frequency of each element, using Collections. frequency() method. Then for each element in the collection list, if the frequency of any element is more than one, then this element is a duplicate element.


2 Answers

I think the easiest way is to start by writing an extension method which find's duplicates in a list of objects. Since you're objects use .Equals() they can be compared in most common collections.

public static IEnumerable<T> FindDuplicates<T>(this IEnumerable<T> enumerable) {
  var hashset = new HashSet<T>();
  foreach ( var cur in enumerable ) { 
    if ( !hashset.Add(cur) ) {
      yield return cur;
    }
  }
}

Now it should be pretty easy to update your collection for duplicates. For instance

List<SomeType> list = GetTheList();
list
  .FindDuplicates()
  .ToList()
  .ForEach(x => x.State = "DUPLICATE");

If you already have a ForEach extentsion method defined in your code, you can avoid the .ToList.

like image 83
JaredPar Avatar answered Sep 17 '22 12:09

JaredPar


Your objects have some sort of state property. You're presumably finding duplicates based on another property or set of properties. Why not:

List<obj> keys = new List<object>();

foreach (MyObject obj in myList)
{
    if (keys.Contains(obj.keyProperty))
        obj.state = "something indicating a duplicate here";
    else
        keys.add(obj.keyProperty)
}
like image 38
Chris Avatar answered Sep 17 '22 12:09

Chris