I have a simple requirement: I have millions of strings, and want to test if they exist in a small set. I'm in doubt between using a List<T>
vs a HashSet<T>
for this set.
When the requirement is the opposite, for example, you have 100 strings and need to check if they exist in a set of millions of strings, I fully understand that the HashSet<T>
is the best choice.
But in my case, it seems that .NET has to calculate millions of hashes (calls to GetHashCode
) when calling Contains
on the HashSet<T>
, so calling Contains
of a List<T>
could be faster?
Can anyone explain if this assumption is correct?
Neither of these seems appropriate to me - a HashSet<string>
sounds like it may be the best approach to me.
Yes, .NET has to compute the hash code for each string - the question is whether or not that takes as long as checking for equality with each of the hundreds of strings in the candidate set.
As per all performance questions, you should really test this rather than guessing. For example, if all the strings have different lengths and they're all long, then Equals
will be cheap against each candidate, and GetHashCode
could take a long time. However, if all your strings are length 10 starting with the same 6 characters, say, then GetHashCode
will be reasonably cheap but each string equality check will have to check all of those common prefix characters. Which of these is more like your actual situation? What have your benchmarks shown? How fast do you need this to be?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With