How can hashset.contains be O(1) with this implementation?

Tags:

HashSet.Contains implementation in .Net is:

    /// <summary>
    /// Checks if this hashset contains the item
    /// </summary>
    /// <param name="item">item to check for containment</param>
    /// <returns>true if item contained; false if not</returns>
    public bool Contains(T item) {
        if (m_buckets != null) {
            int hashCode = InternalGetHashCode(item);
            // see note at "HashSet" level describing why "- 1" appears in for loop
            for (int i = m_buckets[hashCode % m_buckets.Length] - 1; i >= 0; i = m_slots[i].next) {
                if (m_slots[i].hashCode == hashCode && m_comparer.Equals(m_slots[i].value, item)) {
                    return true;
                }
            }
        }
        // either m_buckets is null or wasn't found
        return false;
    }

And I read in a lot of places "search complexity in hashset is O(1)". How? Then why does that for-loop exist?

Edit: .net reference link: https://github.com/microsoft/referencesource/blob/master/System.Core/System/Collections/Generic/HashSet.cs

366

asked Jan 09 '20 20:01

DrkDeveloper

1 Answers

The classic implementation of a hash table works by assigning elements to one of a number of buckets, based on the hash of the element. If the hashing was perfect, i.e. no two elements had the same hash, then we'd be living in a perfectly perfect world where we wouldn't need to care about anything - any lookup would be O(1) always, because we'd only need to compute the hash, get the bucket and say if something is inside.

We're not living in a perfectly perfect world. First off, consider string hashing. In .NET, there are (2^16)^n possible strings of length n; GetHashCode returns a long, and there are 2^64 possible values of long. That's exactly enough to hash every string of length 4 to a unique long, but if we want strings longer than that, there must exist two different values that give the same hash - this is called a collision. Also, we don't want to maintain 2^64 buckets at all times anyway. The usual way of dealing with that is to take the hashcode and compute its value modulo the number of buckets to determine the bucket's number¹. So, the takeaway is - we need to allow for collisions.

The referenced .NET Framework implementation uses the simplest way of dealing with collisions - every bucket holds a linked list of all objects that result in the particular hash. You add object A, it's assigned to a bucket i. You add object B, it has the same hash, so it's added to the list in bucket i right after A. Now if you lookup for any element, you need to traverse the list of all objects and call the actual Equals method to find out if that thing is actually the one you're looking for. That explains the for loop - in the worst case you have to go through the entire list.

Okay, so how "search complexity in hashset is O(1)"? It's not. The worst case complexity is proportional to the number of items. It's O(1) on average.² If all objects fall to the same bucket, asking for the elements at the end of the list (or for ones that are not in the structure but would fall into the same bucket) will be O(n).

So what do people mean by "it's O(1) on average"? The structure monitors how many objects are there proportional to the number of buckets and if that exceeds some threshold, called the load factor, it resizes. It's easy to see that this makes the average lookup time proportional to the load factor.

That's why it's important for hash functions to be uniform, meaning that the probability that two randomly chosen different objects get the same long assigned is 1/2^64³. That keeps the distribution of objects in a hash table uniform, so we avoid pathological cases where one bucket contains a huge number of items.

Note that if you know the hash function and the algorithm used by the hash table, you can force such a pathological case and O(n) lookups. If a server takes inputs from a user and stores them in a hash table, an attacker knowing the hash function and the hash table implementations could use this as a vector for a DDoS attack. There are ways of dealing with that too. Treat this as a demonstration that yes, the worst case can be O(n) and that people are generally aware of that.

There are dozens of other, more complicated ways hash tables can be implemented. If you're interested you need to research on your own. Since lookup structures are so commonplace in computer science, people have come up with all sorts of crazy optimisations that minimise not only the theoretical number of operations, but also things like CPU cache misses.

[1] That's exactly what's happening in the statement int i = m_buckets[hashCode % m_buckets.Length] - 1

[2] At least the ones using naive chaining are not. There exist hash tables with worst-case constant time complexity. But usually they're worse in practice compared to the theoretically (in regards to time complexity) slower implementations, mainly due to CPU cache misses.

[3] I'm assuming the domain of possible hashes is the set of all longs, so there are 2^64 of them, but everything I wrote generalises to any other non-empty, finite set of values.

151

answered Sep 21 '22 23:09

V0ldek

Related questions
                            
                                Disable system font size effect in my app
                            
                                HttpStatusCode to readable string
                            
                                Download large file
                            
                                ASP.net core docker https on Azure App Service Containers
                            
                                Unit testing: Sending complex object as a input parameter to my test method using MSTest
                            
                                WPF Binding enum to Visibility
                            
                                No service for type 'Microsoft.AspNetCore.Identity.UserManager' : while trying to extend IdentityUser?
                            
                                Unable to consume WCF WSHttpBinding in .net core
                            
                                Smooth Lerp movement?
                            
                                Android emulator not connecting to localhost api
                            
                                Can't access Roles in JWT Token .NET Core
                            
                                C# Unity Error: Could not load file or assembly
                            
                                Configure.Await(false) with fire and forget async calls
                            
                                Custom Authorize attribute - ASP .NET Core 2.2
                            
                                What is correct order to AddCors into .Net Core?
                            
                                VSCode Unity OmniSharp .NETFramework not found
                            
                                How can I create an instance of IConfiguration locally?
                            
                                Linq WHERE EF.Functions.Like - Why direct properties work and reflection does not?
                            
                                JsonIgnore attribute keeps serializing properties in ASP.NET Core 3
                            
                                Serilog, Microsoft.Extensions.Logging and Autofac

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can hashset.contains be O(1) with this implementation?

Tags:

c#

time-complexity

.net-core

hashset

DrkDeveloper

People also ask

1 Answers

V0ldek

Recent Activity

Donate For Us