I have a HashSet<int>
and a List<int>
(Hashset has approximately 3 Million items, List has approximately 300k items).
I currently intersect them using
var intersected = hashset.Intersect(list).ToArray();
and I wonder if there is any faster way to do so. Maybe in parallel?
HashSet.IntersectWith (IEnumerable) Method is used to modify the current HashSet object to contain only elements that are present in that object and in the specified collection.
private static IEnumerable<int> Intersect (HashSet<int> hash, List<int> list) { HashSet<int> intersect = new HashSet<int> (list); intersect.IntersectWith (hash); return intersect; }
Convert List to HashSet: 1. Using HashSet Constructor We can make use of HashSet constructorwhich takes an IEnumerable<T>to construct a new instance of the HashSet<T>class containing distinct elements of the list. Note that since HashSetcontains distinct elements, any repeated elements in the list will be discarded.
HashSet (int initialCapacity): This constructor is used to build an empty HashSet object in which the initialCapacity is specified at the time of object creation. Here, the default loadFactor remains 0.75.
HashSet
has a method IntersectWith
that is optimized if intersection is performed between two hash sets. Using method IntersectWith
we can intersect HashSet
and List
using the next approach:
private static IEnumerable<int> Intersect(HashSet<int> hash, List<int> list)
{
HashSet<int> intersect = new HashSet<int>(list);
intersect.IntersectWith(hash);
return intersect;
}
I have measured (using Stopwatch
) performance of your original method (Linq Intersect
), methods proposed by @TheodorZoulias (HashSet Contains
and HashSet Contains Parallel
) and my method (HashSet IntersectWith
). Here are results:
------------------------------------------------------------------------
| Method | Min, ms | Max, ms | Avg, ms | StdDev, ms |
------------------------------------------------------------------------
| Linq Intersect | 135 | 274 | 150 | 17 |
| HashSet Contains | 25 | 44 | 26 | 2 |
| HashSet Contains Parallel | 12 | 53 | 13 | 3 |
| HashSet IntersectWith | 57 | 89 | 61 | 4 |
------------------------------------------------------------------------
From the table we can see that the fastest method is HashSet Contains Parallel
and the slowest is Linq Intersect
.
Here is complete source code that was used to measure performance.
Yes, you can go faster because you have already a HashSet
in hand. The LINQ Intersect
uses a generic algorithm, that essentially recreates a HashSet
from scratch every time it's called. Here is a faster algorithm:
/// <summary>Yields all the elements of first (including duplicates) that also
/// appear in second, in the order in which they appear in first.</summary>
public static IEnumerable<TSource> Intersect<TSource>(IEnumerable<TSource> first,
HashSet<TSource> second)
{
foreach (TSource element in first)
{
if (second.Contains(element)) yield return element;
}
}
Update: Here is a parallel version of the above idea:
var intersected = list.AsParallel().Where(x => hashset.Contains(x)).ToArray();
I wouldn't expect it to be much faster, if at all, because the workload is too granular. The overhead of calling a lambda 300,000 times will probably overshadow any benefits of the parallelism.
Also the order of the results will not be preserved, unless the AsOrdered
PLINQ method is added in the query, hurting further the performance of the operation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With