Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When CPython set `in` operator is O(n)?

I was reading about the time complexity of set operations in CPython and learned that the in operator for sets has the average time complexity of O(1) and worst case time complexity of O(n). I also learned that the worst case wouldn't occur in CPython unless the set's hash table's load factor is too high.

This made me wonder, when such a case would occur in the CPython implementation? Is there a simple demo code, which shows a set with clearly observable O(n) time complexity of the in operator?

like image 818
ruohola Avatar asked Dec 07 '19 01:12

ruohola


1 Answers

Load factor is a red herring. In CPython sets (and dicts) automatically resize to keep the load factor under 2/3. There's nothing you can do in Python code to stop that.

O(N) behavior can occur when a great many elements have exactly the same hash code. Then they map to the same hash bucket, and set lookup degenerates to a slow form of linear search.

The easiest way to contrive such bad elements is to create a class with a horrible hash function. Like, e.g., and untested:

class C:
    def __init__(self, val):
        self.val = val
    def __eq__(a, b):
        return a.val == b.val
    def __hash__(self):
        return 3

Then hash(C(i)) == 3 regardless of the value of i.

To do the same with builtin types requires deep knowledge of their CPython implementation details. For example, here's a way to create an arbitrarily large number of distinct ints with the same hash code:

>>> import sys
>>> M = sys.hash_info.modulus
>>> set(hash(1 + i*M) for i in range(10000))
{1}

which shows that the ten thousand distinct ints created all have hash code 1.

like image 130
Tim Peters Avatar answered Oct 09 '22 20:10

Tim Peters