Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How is HashSet<T>.Contains faster than List<T>.Contains?

I have a simple requirement: I have millions of strings, and want to test if they exist in a small set. I'm in doubt between using a List<T> vs a HashSet<T> for this set.

When the requirement is the opposite, for example, you have 100 strings and need to check if they exist in a set of millions of strings, I fully understand that the HashSet<T> is the best choice.

But in my case, it seems that .NET has to calculate millions of hashes (calls to GetHashCode) when calling Contains on the HashSet<T>, so calling Contains of a List<T> could be faster?

Can anyone explain if this assumption is correct?

like image 594
Maestro Avatar asked Dec 10 '22 05:12

Maestro


1 Answers

Neither of these seems appropriate to me - a HashSet<string> sounds like it may be the best approach to me.

Yes, .NET has to compute the hash code for each string - the question is whether or not that takes as long as checking for equality with each of the hundreds of strings in the candidate set.

As per all performance questions, you should really test this rather than guessing. For example, if all the strings have different lengths and they're all long, then Equals will be cheap against each candidate, and GetHashCode could take a long time. However, if all your strings are length 10 starting with the same 6 characters, say, then GetHashCode will be reasonably cheap but each string equality check will have to check all of those common prefix characters. Which of these is more like your actual situation? What have your benchmarks shown? How fast do you need this to be?

like image 52
Jon Skeet Avatar answered Dec 31 '22 11:12

Jon Skeet