Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the time complexity of searching in Dict if very long strings are used as keys?

I read from python3 document, that python use hash table for dict(). So the search time complexity should be O(1) with O(N) as the worst case. However, recently as I took a course, the teacher says that happens only when you use int as the key. If you use a string of length L as keys the search time complexity is O(L).

I write a code snippet to test out his honesty

import random
import string
from time import time
import matplotlib.pyplot as plt

def randomString(stringLength=10):
    """Generate a random string of fixed length """
    letters = string.ascii_lowercase
    return ''.join(random.choice(letters) for i in range(stringLength))

def test(L):
    #L: int length of keys

    N = 1000 # number of keys
    d = dict()
    for i in range(N):
        d[randomString(L)] = None

    tic = time()
    for key in d.keys():
        d[key]
    toc = time() - tic

    tic = time()
    for key in d.keys():
        pass
    t_idle = time() - tic

    t_total = toc - t_idle
    return t_total

L = [i * 10000 for i in range(5, 15)]
ans = [test(l) for l in L]

plt.figure()
plt.plot(L, ans)
plt.show()

The result is very interesting. As you can see, the x-axis is the length of the strings used as keys and the y-axis is the total time to query all 1000 keys in the dictionary.

enter image description here

Can anyone explain this result?

Please be gentle on me. As you can see, if I ask this basic question, that means I don't have the ability to read python source code or equivalently complex insider document.

like image 293
River Avatar asked Mar 03 '23 18:03

River


1 Answers

Since a dictionary is a hashtable, and looking up a key in a hashtable requires computing the key's hash, then the time complexity of looking up the key in the dictionary cannot be less than the time complexity of the hash function.

In current versions of CPython, a string of length L takes O(L) time to compute the hash of if it's the first time you've hashed that particular string object, and O(1) time if the hash for that string object has already been computed (since the hash is stored):

>>> from timeit import timeit
>>> s = 'b' * (10**9) # string of length 1 billion
>>> timeit(lambda: hash(s), number=1)
0.48574538500002973 # half a second
>>> timeit(lambda: hash(s), number=1)
5.301000044255488e-06 # 5 microseconds

So that's also how long it takes when you look up the key in a dictionary:

>>> s = 'c' * (10**9) # string of length 1 billion
>>> d = dict()
>>> timeit(lambda: s in d, number=1)
0.48521506899999167 # half a second
>>> timeit(lambda: s in d, number=1)
4.491000026973779e-06 # 5 microseconds

You also need to be aware that a key in a dictionary is not looked up only by its hash: when the hashes match, it still needs to test that the key you looked up is equal to the key used in the dictionary, in case the hash matching is a false positive. Testing equality of strings takes O(L) time in the worst case:

>>> s1 = 'a'*(10**9)
>>> s2 = 'a'*(10**9)
>>> timeit(lambda: s1 == s2, number=1)
0.2006020820001595

So for a key of length L and a dictionary of length n:

  • If the key is not present in the dictionary, and its hash has already been cached, then it takes O(1) average time to confirm it is absent.
  • If the key is not present and its hash has not been cached, then it takes O(L) average time because of computing the hash.
  • If the key is present, it takes O(L) average time to confirm it is present whether or not the hash needs to be computed, because of the equality test.
  • The worst case is always O(nL) because if every hash collides and the strings are all equal except in the last places, then a slow equality test has to be done n times.
like image 113
kaya3 Avatar answered Mar 05 '23 07:03

kaya3