Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use Group By in order to remove duplicates

I am looking for a simple way of removing duplicates without having to implement the class IComparable, having to override GetHashCode etc..

I think this can be achieved with linq. I have the class:

class Person
{
    public string Name;
    public ing Age;
}

I have a list of about 500 People List<Person> someList = new List<Person()

now I want to remove people with the same name and if there is a duplicate I want to keep the person that had the greater age. In other words if I have the list:

Name----Age---
Tom,     24  |
Alicia,  22  |
Alicia,  12  |

I will like to end up with:

Name----Age---
Tom,     24  |
Alicia,  22  |

How can I do this with a query? My list is not that long so I don't want to create a hash set nor implement the IComparable interface. It will be nice if I can do this with a linq query.

I think this can be done with the groupBy extension method by doing something like:

var people = // the list of Person
person.GroupBy(x=>x.Name).Where(x=>x.Count()>1)
      ...    // select the person that has the greatest age...
like image 235
Tono Nam Avatar asked Jan 25 '26 04:01

Tono Nam


2 Answers

people
  .GroupBy(p => p.Name)
  .Select(g => g.OrderByDescending(p => p.Age).First())

This will work across different Linq providers. If this is just Linq2Objects, and speed is important (usually, it isn't) consider using one of the many MaxBy extensions found on the web (here's Skeet's) and replacing

g.OrderByDescending(p => p.Age).First()

with

g.MaxBy(p => p.Age)
like image 154
spender Avatar answered Jan 26 '26 16:01

spender


This can be trivially easy so long as you first create a helper function MaxBy that is capable of selecting the item from a sequence who's selector is largest. Unfortunately the Max function in LINQ won't work as we want to select the item from the sequence, not the selected value.

var distinctPeople = people.GroupBy(person => person.Name)
   .Select(group => group.MaxBy(person => person.Age));

And then the implementation of MaxBy:

public static TSource MaxBy<TSource, TKey>(this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector, IComparer<TKey> comparer = null)
{
    comparer = comparer ?? Comparer<TKey>.Default;

    using (var iterator = source.GetEnumerator())
    {
        if (!iterator.MoveNext())
            throw new ArgumentException("Source must have at least one item");

        var maxItem = iterator.Current;
        var maxKey = keySelector(maxItem);

        while (iterator.MoveNext())
        {
            var nextKey = keySelector(iterator.Current);
            if (comparer.Compare(nextKey, maxKey) > 0)
            {
                maxItem = iterator.Current;
                maxKey = nextKey;
            }
        }

        return maxItem;
    }
}

Note that while you can achieve the same result by sorting the sequence and then taking the first item, doing so is less efficient in general than doing just one pass with a max function.

like image 37
Servy Avatar answered Jan 26 '26 18:01

Servy



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!