Let's say I have a List with duplicate values and I want to remove the duplicates. <pre class="prettyprint"><code>List<int> myList = new List<int>(Enumerable.Range(0, 10000)); // adding a few duplicates here myList.Add(1); myList.Add(2); myList.Add(3); </code></pre> I have found 3 approaches to solve this: <pre class="prettyprint"><code>List<int> result1 = new HashSet<int>(myList).ToList(); //3700 ticks List<int> result2 = myList.Distinct().ToList(); //4700 ticks List<int> result3 = myList.GroupBy(x => x).Select(grp => grp.First()).ToList(); //18800 ticks //referring to pinturic's comment: List<int> result4 = new SortedSet<int>(myList).ToList(); //18000 ticks </code></pre> In most answers here on SO, the Distinct approach is shown as the "correct one", yet the HashSet is always faster! My question: is there anything I have to be aware of when I use the HashSet approach and is there another more efficient way?

There is a big difference between these two approaches: <pre class="prettyprint"><code>List<int> Result1 = new HashSet<int>(myList).ToList(); //3700 ticks List<int> Result2 = myList.Distinct().ToList(); //4700 ticks </code></pre> The first one can (will probably) change the order of the elements of the returned <code>List<></code>: <code>Result1</code> elements won't be in the same order of <code>myList</code>'s ones. The second maintains the original ordering. There is probably no faster way than the first one. There is probably no "more correct" (for a certain definition of "correct" based on ordering) than the second one. (the third one is similar to the second one, only slower) Just out of curiousity, the <code>Distinct()</code> is: <pre class="prettyprint"><code>// Reference source http://referencesource.microsoft.com/#System.Core/System/Linq/Enumerable.cs,712 public static IEnumerable<TSource> Distinct<TSource>(this IEnumerable<TSource> source) { if (source == null) throw Error.ArgumentNull("source"); return DistinctIterator<TSource>(source, null); } // Reference source http://referencesource.microsoft.com/#System.Core/System/Linq/Enumerable.cs,722 static IEnumerable<TSource> DistinctIterator<TSource>(IEnumerable<TSource> source, IEqualityComparer<TSource> comparer) { Set<TSource> set = new Set<TSource>(comparer); foreach (TSource element in source) if (set.Add(element)) yield return element; } </code></pre> So in the end the <code>Distinct()</code> simply uses an internal implementation of an <code>HashSet<></code> (called <code>Set<></code>) to check for the uniqueness of items. For completeness sake, I'll add a link to the question Does C# Distinct() method keep original ordering of sequence intact?

Most efficient way to remove duplicates from a List

Tags:

c#

list

distinct

Let's say I have a List with duplicate values and I want to remove the duplicates.

List<int> myList = new List<int>(Enumerable.Range(0, 10000));

// adding a few duplicates here
myList.Add(1); 
myList.Add(2);
myList.Add(3);

I have found 3 approaches to solve this:

List<int> result1 = new HashSet<int>(myList).ToList(); //3700 ticks
List<int> result2 = myList.Distinct().ToList(); //4700 ticks
List<int> result3 = myList.GroupBy(x => x).Select(grp => grp.First()).ToList(); //18800 ticks
//referring to pinturic's comment:
List<int> result4 = new SortedSet<int>(myList).ToList(); //18000 ticks

In most answers here on SO, the Distinct approach is shown as the "correct one", yet the HashSet is always faster!

My question: is there anything I have to be aware of when I use the HashSet approach and is there another more efficient way?

986

asked May 21 '15 07:05

fubo

Video Answer

1 Answers

There is a big difference between these two approaches:

List<int> Result1 = new HashSet<int>(myList).ToList(); //3700 ticks List<int> Result2 = myList.Distinct().ToList(); //4700 ticks

The first one can (will probably) change the order of the elements of the returned List<>: Result1 elements won't be in the same order of myList's ones. The second maintains the original ordering.

There is probably no faster way than the first one.

There is probably no "more correct" (for a certain definition of "correct" based on ordering) than the second one.

(the third one is similar to the second one, only slower)

Just out of curiousity, the Distinct() is:

// Reference source http://referencesource.microsoft.com/#System.Core/System/Linq/Enumerable.cs,712 public static IEnumerable<TSource> Distinct<TSource>(this IEnumerable<TSource> source) {     if (source == null) throw Error.ArgumentNull("source");     return DistinctIterator<TSource>(source, null); }  // Reference source http://referencesource.microsoft.com/#System.Core/System/Linq/Enumerable.cs,722 static IEnumerable<TSource> DistinctIterator<TSource>(IEnumerable<TSource> source, IEqualityComparer<TSource> comparer) {     Set<TSource> set = new Set<TSource>(comparer);     foreach (TSource element in source)         if (set.Add(element)) yield return element; }

So in the end the Distinct() simply uses an internal implementation of an HashSet<> (called Set<>) to check for the uniqueness of items.

For completeness sake, I'll add a link to the question Does C# Distinct() method keep original ordering of sequence intact?

165

answered Sep 23 '22 16:09

xanatos

Related questions
                            
                                Access remote file contents as a stream using WinSCP .NET assembly
                            
                                Determine number of rows in DataGridView except of new row
                            
                                Single Instance Windows Forms Application with Minimize to Tray
                            
                                Assign value to object using Expression
                            
                                Move Startup.cs to Class Library (Package) Project - ASP.NET 5
                            
                                Deploying VSTO Add-In to All Users
                            
                                Type inference discrepancy between method and extension method arguments
                            
                                Should Task.Wait be deprecated?
                            
                                Datatables.net slow in rendering and applying paging
                            
                                Design pattern for including errors with return values
                            
                                how to get only the type of Enumerable?
                            
                                How to implement a clean Custom Object Initializer for a Matrix class
                            
                                Extract Description Attribute from Const Fields
                            
                                AssemblyInfo build number not updated
                            
                                MVVM Light Messenger Receive Method
                            
                                Customizing AutoFixure using FromSeed Causes Exception
                            
                                Displaying string that contains '\t'
                            
                                Autofac - SingleInstance HttpClient
                            
                                How to debug (step into) a class library referenced in my project and has .pdb and source code?
                            
                                Creating instance of Entity Framework Context slows down under load

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With