I'm wondering whether I can get a consensus on which method is the better approach to creating a distinct set of elements: a <code>C# HashSet</code> or using <code>IEnumerable's .Distinct()</code>, which is a Linq function? Let's say I'm looping through query results from the DB with DataReader, and my options are to add the objects I construct to a <code>List<SomeObject></code> or to a <code>HashSet<SomeObject></code> With the <code>List</code> option, I would wind up having to do something like: <code>myList = myList.Distinct().ToList<SomeObject>();</code> With the <code>HashSet</code>, my understanding is that adding elements to it takes care of the non-duplication by itself, assuming you've overrided the <code>GetHashCode()</code> and <code>Equals()</code> methods in SomeObject. I'm concerned mainly with the risks and performance aspects of the options. Thanks.

Anthony Pegram has said it the best. Use the right tool for the job. I say this because a <code>Distinct</code> or <code>HashSet</code> isn't that big different when it comes to performance. Use a <code>HashSet</code> when the collection should always hold only distinct stuffs. It also tells the programmer that you cant add duplicates to it. Use a normal <code>List<T></code> and <code>.Distinct()</code> ont it when you will have to add duplicates and remove duplicates later. The intention matters. In general, a) a HashSet may not do any good if you're adding new objects from db and you haven't specified a custom <code>Equals</code> of your own. Every object from db can be a new instance for your hashset (if you are just new-ing) and that will lead to duplicates in the collection. In that case use normal <code>List<T></code>. b) If you do have an equality comparer defined for hashset, and your collection should always hold only distinct objects, use hashset. c) If you do have an equality comparer defined for hashset, and you want only distinct objects from db but collection need not always hold only distinct objects (ie duplicates needed to be added later), a faster approach is to get the items from db to a hashset and then return a regular list from that hashset. d) The best thing you should do is to give the task of removing duplicates to database, thats the right tool And that's first class! As for performance differences, in my testing I always found HashSet to be faster, but then that's only marginal. That's obvious considering with List approach you have to first add and then do a distinct on it. Test method: Starting with two general functions, <pre class="prettyprint"><code>public static void Benchmark(Action method, int iterations = 10000) { Stopwatch sw = new Stopwatch(); sw.Start(); for (int i = 0; i < iterations; i++) method(); sw.Stop(); MsgBox.ShowDialog(sw.Elapsed.TotalMilliseconds.ToString()); } public static List<T> Repeat<T>(this ICollection<T> lst, int count) { if (count < 0) throw new ArgumentOutOfRangeException("count"); var ret = Enumerable.Empty<T>(); for (var i = 0; i < count; i++) ret = ret.Concat(lst); return ret.ToList(); } </code></pre> Implementation: <pre class="prettyprint"><code>var d = Enumerable.Range(1, 100).ToList().Repeat(100); HashSet<int> hash = new HashSet<int>(); Benchmark(() => { hash.Clear(); foreach (var item in d) { hash.Add(item); } }); </code></pre> <blockquote> ~3300 ms </blockquote> <pre class="prettyprint"><code>var d = Enumerable.Range(1, 100).ToList().Repeat(100); List<int> list = new List<int>(); Benchmark(() => { list.Clear(); foreach (var item in d) { list.Add(item); } list = list.Distinct().ToList(); }); </code></pre> <blockquote> ~5800 ms </blockquote> A difference of 2.5 seconds is not bad for a list of 10000 objects when iterated another 10000 times. For normal cases the difference will be hardly noticeable. The best approach possibly for you with your current design: <pre class="prettyprint"><code>var d = Enumerable.Range(1, 100).ToList().Repeat(100); HashSet<int> hash = new HashSet<int>(); List<int> list = new List<int>(); Benchmark(() => { hash.Clear(); foreach (var item in d) { hash.Add(item); } list = hash.ToList(); }); </code></pre> <blockquote> ~3300 ms </blockquote> There isn't any significant difference, see.. <hr> Partly unrelated - after posting this answer, I was curious to know what's the best approach in removing duplicates, from a normal list. <pre class="prettyprint"><code>var d = Enumerable.Range(1, 100).ToList().Repeat(100); HashSet<int> hash = new HashSet<int>(); List<int> list = new List<int>(); Benchmark(() => { hash = new HashSet<int>(d); }); </code></pre> <blockquote> ~3900 ms </blockquote> <pre class="prettyprint"><code>var d = Enumerable.Range(1, 100).ToList().Repeat(100); List<int> list = new List<int>(); Benchmark(() => { list = d.Distinct().ToList(); }); </code></pre> <blockquote> ~3200 ms </blockquote> Here the right tool <code>Distinct</code> is faster than hackish <code>HashSet</code>! Perhaps its the overhead of creating a hash set. <hr> I have tested with various other combinations like reference types, without duplicates in original list etc. The results are consistent.

What's better is what's the most expressive of describing your intention. The internal implementation details are more or less going to be the same, the difference being "who's writing the code?" If your intention is to create from the ground up a distinct collection of items from a source that is not a collection of said items, I would argue for the <code>HashSet<T></code>. You have to create the item, you have to build the collection, you might as well build the right one from the beginning. Otherwise, if you already have a collection of items and you want to eliminate duplicates, I would argue for invoking <code>Distinct()</code>. You already have a collection, you just want an expressive way to get the distinct items out of it.

What's better for creating distinct data structures: HashSet or Linq's Distinct()?

Tags:

performance

linq

distinct

hashset

I'm wondering whether I can get a consensus on which method is the better approach to creating a distinct set of elements: a C# HashSet or using IEnumerable's .Distinct(), which is a Linq function?

Let's say I'm looping through query results from the DB with DataReader, and my options are to add the objects I construct to a List<SomeObject> or to a HashSet<SomeObject> With the List option, I would wind up having to do something like:

myList = myList.Distinct().ToList<SomeObject>();

With the HashSet, my understanding is that adding elements to it takes care of the non-duplication by itself, assuming you've overrided the GetHashCode() and Equals() methods in SomeObject. I'm concerned mainly with the risks and performance aspects of the options.

Thanks.

287

asked Jun 09 '11 20:06

MegaMatt

2 Answers

Anthony Pegram has said it the best. Use the right tool for the job. I say this because a Distinct or HashSet isn't that big different when it comes to performance. Use a HashSet when the collection should always hold only distinct stuffs. It also tells the programmer that you cant add duplicates to it. Use a normal List<T> and .Distinct() ont it when you will have to add duplicates and remove duplicates later. The intention matters.

In general,

a) a HashSet may not do any good if you're adding new objects from db and you haven't specified a custom Equals of your own. Every object from db can be a new instance for your hashset (if you are just new-ing) and that will lead to duplicates in the collection. In that case use normal List<T>.

b) If you do have an equality comparer defined for hashset, and your collection should always hold only distinct objects, use hashset.

c) If you do have an equality comparer defined for hashset, and you want only distinct objects from db but collection need not always hold only distinct objects (ie duplicates needed to be added later), a faster approach is to get the items from db to a hashset and then return a regular list from that hashset.

d) The best thing you should do is to give the task of removing duplicates to database, thats the right tool And that's first class!

As for performance differences, in my testing I always found HashSet to be faster, but then that's only marginal. That's obvious considering with List approach you have to first add and then do a distinct on it.

Test method: Starting with two general functions,

public static void Benchmark(Action method, int iterations = 10000)
{
    Stopwatch sw = new Stopwatch();
    sw.Start();
    for (int i = 0; i < iterations; i++)
        method();

    sw.Stop();
    MsgBox.ShowDialog(sw.Elapsed.TotalMilliseconds.ToString());
}

public static List<T> Repeat<T>(this ICollection<T> lst, int count)
{
    if (count < 0)
        throw new ArgumentOutOfRangeException("count");

    var ret = Enumerable.Empty<T>();

    for (var i = 0; i < count; i++)
        ret = ret.Concat(lst);

    return ret.ToList();
}

Implementation:

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();

Benchmark(() =>
{
    hash.Clear();
    foreach (var item in d)
    {
        hash.Add(item);
    }
});

~3300 ms

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
List<int> list = new List<int>();

Benchmark(() =>
{
    list.Clear();
    foreach (var item in d)
    {
        list.Add(item);
    }

    list = list.Distinct().ToList();
});

~5800 ms

A difference of 2.5 seconds is not bad for a list of 10000 objects when iterated another 10000 times. For normal cases the difference will be hardly noticeable.

The best approach possibly for you with your current design:

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();
List<int> list = new List<int>();

Benchmark(() =>
{
    hash.Clear();
    foreach (var item in d)
    {
        hash.Add(item);
    }

    list = hash.ToList();
});

~3300 ms

There isn't any significant difference, see..

Partly unrelated - after posting this answer, I was curious to know what's the best approach in removing duplicates, from a normal list.

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();
List<int> list = new List<int>();

Benchmark(() =>
{
    hash = new HashSet<int>(d);
});

~3900 ms

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
List<int> list = new List<int>();

Benchmark(() =>
{
    list = d.Distinct().ToList();
});

~3200 ms

Here the right tool Distinct is faster than hackish HashSet! Perhaps its the overhead of creating a hash set.

I have tested with various other combinations like reference types, without duplicates in original list etc. The results are consistent.

154

answered Oct 13 '22 11:10

nawfal

What's better is what's the most expressive of describing your intention. The internal implementation details are more or less going to be the same, the difference being "who's writing the code?"

If your intention is to create from the ground up a distinct collection of items from a source that is not a collection of said items, I would argue for the HashSet<T>. You have to create the item, you have to build the collection, you might as well build the right one from the beginning.

Otherwise, if you already have a collection of items and you want to eliminate duplicates, I would argue for invoking Distinct(). You already have a collection, you just want an expressive way to get the distinct items out of it.

answered Oct 13 '22 10:10

Anthony Pegram

Related questions
                            
                                LINQ Lambda Join Error - cannot be inferred from the usage
                            
                                Using .Select and .Where in a single LINQ statement
                            
                                Linq filter List<string> where it contains a string value from another List<string>
                            
                                Linq Take() question
                            
                                List sort based on another list
                            
                                Entity Framework select one of each group by date
                            
                                Defining a one-to-one relationship in SQL Server
                            
                                Aggregate vs Sum Performance in LINQ
                            
                                IEnumerable<T>.Contains with predicate
                            
                                Determine sequence contains no element using LINQ [duplicate]
                            
                                OrderBy and List vs. IOrderedEnumerable
                            
                                C# linq sort - quick way of instantiating IComparer
                            
                                LINQ to MySQL - what is the best option?
                            
                                Setting anonymous type property name
                            
                                how to add XElement in specific location in XML Document
                            
                                Bad implementation of Enumerable.Single?
                            
                                How to use LINQ in C++/CLI - in VS 2010/.Net 4.0
                            
                                Why is the LINQ "apply-to-all" method named Select?
                            
                                Conditional Include() in Entity Framework [duplicate]
                            
                                How To Project a Line Number Into Linq Query Results

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With