When CPython set `in` operator is O(n)?

Question

I was reading about the time complexity of set operations in CPython and learned that the in operator for sets has the average time complexity of O(1) and worst case time complexity of O(n). I also learned that the worst case wouldn't occur in CPython unless the set's hash table's load factor is too high.

This made me wonder, when such a case would occur in the CPython implementation? Is there a simple demo code, which shows a set with clearly observable O(n) time complexity of the in operator?

Tim Peters · Accepted Answer

Load factor is a red herring. In CPython sets (and dicts) automatically resize to keep the load factor under 2/3. There's nothing you can do in Python code to stop that.

O(N) behavior can occur when a great many elements have exactly the same hash code. Then they map to the same hash bucket, and set lookup degenerates to a slow form of linear search.

The easiest way to contrive such bad elements is to create a class with a horrible hash function. Like, e.g., and untested:

class C:
    def __init__(self, val):
        self.val = val
    def __eq__(a, b):
        return a.val == b.val
    def __hash__(self):
        return 3

Then hash(C(i)) == 3 regardless of the value of i.

To do the same with builtin types requires deep knowledge of their CPython implementation details. For example, here's a way to create an arbitrarily large number of distinct ints with the same hash code:

>>> import sys
>>> M = sys.hash_info.modulus
>>> set(hash(1 + i*M) for i in range(10000))
{1}

which shows that the ten thousand distinct ints created all have hash code 1.

When CPython set `in` operator is O(n)?

Tags:

python

time-complexity

set

cpython

in-operator

ruohola

1 Answers

Tim Peters

Recent Activity

Donate For Us

When CPython set `in` operator is O(n)?

Tags:

python

time-complexity

set

cpython

in-operator

ruohola

1 Answers

Tim Peters

Related questions

Recent Activity

Donate For Us