Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how come key in dictionary.keys() slower than key in dictionary?

While reading Python 101 by Michael Driscoll I got to his explanation on checking if a key exists in a dictionary. I checked it on my machine with a dictionary containing keys 'a' to 'z' where the value is their order, and a function that measures time on retrieving the key 't' (picked it at random)

Here's my function:

def f(d, flag):
start_time = time.time()
if flag:
    print("t" in d)
else:
    print("t" in d.keys())
print("--- %s seconds ---" % (time.time() - start_time))

And here are the results:

>>> f(dict,True)
True
--- 0.03937530517578125 seconds ---

>>> f(dict,False)
True
--- 0.05114388465881348 seconds ---

BUT, I still don't get it. I thought that key in dict.keys() would result in iteration on much smaller collection, which will be faster. Is there something special in the implementation of in or keys() that causes this?

like image 610
CIsForCookies Avatar asked May 29 '17 13:05

CIsForCookies


1 Answers

Using dictionary.keys() is slower because it does more work:

  • It adds an attribute lookup; dictionary.keys
  • It adds a method call (keys()), requiring the current call frame to be pushed onto the stack, and popped afterwards.
  • Another object has to be created for the return value (a dictionary view). It's a light-weight object, but you still need to allocate memory for it on the heap.

None of this is needed, because a containment test against the dictionary and the dictionary view on the keys test the exact same thing. Containment testing directly on a dictionary does not include the values, you are testing for keys only, in both cases.

From the dict() object documentation:

key in d
Return True if d has a key key, else False.

Note that using walk-clock time is not a great way of testing for performance differences. Use the timeit module instead, which picks the best performing timer, disables the GC to eliminate a source of skew, and repeats the test many times to minimise system skew.

You can reproduce the time difference by testing for the additional steps above separately (combining the call and object creation into one). By default timeit.timet() repeats the test 1 million times and returns the total time taken:

>>> import timeit
>>> from string import ascii_lowercase
>>> d = {l: i for i, l in enumerate(ascii_lowercase)}
>>> 't' in d
True
>>> timeit.timeit('d.keys', globals={'d': d})
0.0439452639548108
>>> timeit.timeit('keys()', globals={'keys': d.keys})
0.06267352704890072

So merely looking up the .keys attribute 1 million times already takes 44 milliseconds, while calling the method (without attribute lookup) adds another 63ms. Both methods have some overhead for looking up the global name however:

>>> timeit.timeit('d', globals={'d': d})
0.027833244064822793

So one would expect there to be a 107 - 28 == 79ms (roughly) difference between the two methods.

And indeed, the time difference between using 't' in d and 't' in d.keys() is about that much:

>>> timeit.timeit('"t" in d.keys()', globals={'d': d})
0.11647015693597496
>>> timeit.timeit('"t" in d', globals={'d': d})
0.0370339349610731

116 - 37 is 79 milliseconds, as predicted.

like image 102
Martijn Pieters Avatar answered Oct 16 '22 20:10

Martijn Pieters