Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is HashSet<T> the fastest container to look up in?

I need to check that specific string contains in the set of others:

private bool Contains(string field)
{
   return this.Fields.Contains(field); // HashSet<string> local property
}

What is the best type of container to use if only one task of it - to hold a number of strings and check does another one is into or does not?

like image 446
abatishchev Avatar asked Feb 13 '10 20:02

abatishchev


People also ask

Is HashSet fast?

The main difference between HashSet and List is that the list can contain duplicate elements, though both are used to store unique values. HashSet is much faster than List collection because HashSet lazily initializes underlying data structures.

Why is HashSet faster?

Simply put, HashSet is faster than the TreeSet.HashSet provides constant-time performance for most operations like add(), remove() and contains(), versus the log(n) time offered by the TreeSet. Usually, we can see that the execution time for adding elements into TreeSet is much more than for the HashSet.

When should I use HashSet?

A HashSet is usually used for high-performance operations involving a set of unique data. Since HashSet contains only unique elements, its internal structure is optimized for faster searches. Note that you can store a single null value in a HashSet.

What is Hash set in C#?

In C#, HashSet is an unordered collection of unique elements. This collection is introduced in . NET 3.5. It supports the implementation of sets and uses the hash table for storage. This collection is of the generic type collection and it is defined under System.


2 Answers

Yes, HashSet is perfect for this since it contains one value to look up unlike a Dictionary which requires a key and a value.

like image 127
Ahmad Mageed Avatar answered Sep 23 '22 01:09

Ahmad Mageed


Does HashSet work? Sure. But that's not the question you asked. You asked for the fastest possible lookup.

Is it the fastest possible? No, of course not, not by any measure.

First off, in order to talk about "fastest" we need to describe precisely what "fastest" means. Do you mean:

  • smallest worst possible case timing
  • smallest average timing averaged over many timings
  • smallest average timing given a particular usage pattern
  • something else

? Please clarify precisely what "fastest possible" means. We can devise you an algorithm which is the in theory fastest possible only if we know precisely what fastest possible means to you.

For example, suppose you are writing a compiler. Something we have to do all the time in compilers is check whether a particular string is in a list of strings. Perhaps we are checking to see if a string is a keyword, so we have to look up whether a given string is inside the set {"int", "double", "for", "foreach", "class" ... }

We could put those in a hash set and get decent performance. But if we wanted the best possible performance we could do a lot better. We could, for example, do an analysis of a few billion lines of existing source code to find out which keywords were most common and which were least common, and then write a custom hash table that optimized for (1) rapidly rejecting things that were not keywords at all, and (2) rapidly recognizing the most common keywords at the expense of recognizing other keywords.

Note that this requires static analysis; though it performs well on typical cases, it performs poorly on those rare cases where there are lots of rare keywords used. Another approach we could take would be to write a self tuning hash table that dynamically identified when particular strings were being searched for frequently.

Consider, for example, if you are writing an implementation of the JScript runtime. We frequently must look for a string in a set of strings:

for(i = 0; i < 10; ++i) { foo.bar(i); }

Here we must look up the string "bar" inside the object identified by "foo" ten times. The hash table inside "foo" that implements that lookup notices the first time through the loop that "bar" has been used, so it dynamically tweaks the hash table structure so that the second time through the loop, the lookup is faster. This is the strategy we employed in our implementation of JScript.

Now, that optimizes the case for loops, but it makes this case potentially slower than it could be:

for(i = 0; i < 10; ++i) { foo.bar(i); foo.blah(i); foo.abc(i); }

because we don't do more analysis and realize "hey, we just re-optimized this hash table three times, and now we're going to do it all again, maybe we should just leave it as it is."

Fortunately for us, we were not, like you, looking for the fastest possible lookup. We were only looking for a reasonably fast lookup.

Can you carefully and completely describe what exactly your usage case is for the fastest possible lookup? There are lots of algorithms you can use to speed up lookups, but they get very complicated.

like image 36
Eric Lippert Avatar answered Sep 23 '22 01:09

Eric Lippert