I ran across an assertion that HashSet<T>.Contains() is an O(1) operation. This surprised me since every discussion of hashing I've encountered mentions the possibility of collisions, potentially leading to O(n) run time.
Being curious, I looked into documentation for HashSet<T>.Contains and also HashTable.Contains. The documentation for both methods make that same claim.
When I look in reflector, HashSet<T>.Contains() is implemented with a for loop, going through a list of slots containing values which have the same hash.
Now admittedly, those same discussions of hashing have also mentioned that a good hashing algorithm avoids collisions and under those circumstances the lookup will indeed be O(1). But my understanding of Big O notation is that it's the worst-case runtime time, not best.
So is the O(1) claim incorrect? Or am I missing something?
But my understanding of Big O notation is that it's the worst-case runtime time, not best.
Unfortunately, there is no "standard" for Big-O when describing algorithms. Often, it's used to describe the general or average case - not the worst case.
From Wikipedia:
...this notation is now frequently also used in the analysis of algorithms to describe an algorithm's usage of computational resources: the worst case or average case...
In this case, it's describing a standard case, given proper hashing. If you have proper hashing in place, the limiting behavior will be constant for size N, hence O(1).
In general, it's O(1).
For a properly implemented hash table, look ups have amortized constant time complexity.
In practice, a single look up can be O(n) in case of collisions, as you say. However, if you perform a large number of look ups the average time complexity per operation is constant.
Quoting wikipedia:
Amortized analysis differs from average-case performance in that probability is not involved; amortized analysis guarantees the time per operation over worst-case performance.
The method requires knowledge of which series of operations are possible. This is most commonly the case with data structures, which have state that persists between operations. The basic idea is that a worst case operation can alter the state in such a way that the worst case cannot occur again for a long time, thus "amortizing" its cost.
No, Big O does not define "worst case", it defines a limit. Hash-based lookups (with good hashing algorithms that provide efficient value distribution and a low collision rate) progress toward a constant value as the number of items increases (they will never reach or that constant value, but that's the point of it being a limit).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With