Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python 2.7 writing "x in set" vs "set.__contains__(x)" [duplicate]

I've run the following small test in python 2.7.6:

s = set(xrange(0, 1000000))
for i in xrange(0, 5000000):
    if s.__contains__(i):
        pass

and got the following output for running time python py.py:

real    0m0.616s

then I changed my code to:

s = set(xrange(0, 1000000))
for i in xrange(0, 5000000):
    if i in s:
        pass

and got run time of 0.467s. I also got same results for dict. My question is "why there is performance difference?", maybe some explanation of how python performs calls of s.__contains__(i) and i in s

like image 942
deathnik Avatar asked Dec 25 '22 04:12

deathnik


1 Answers

When using s.__contains__(i) you need to do the attribute lookup in Python, as well as make a method call in Python. These are executed as separate bytecodes:

>>> import dis
>>> dis.dis(compile('s.__contains__(i)', '<>', 'exec'))
  1           0 LOAD_NAME                0 (s)
              3 LOAD_ATTR                1 (__contains__)
              6 LOAD_NAME                2 (i)
              9 CALL_FUNCTION            1
             12 POP_TOP             
             13 LOAD_CONST               0 (None)
             16 RETURN_VALUE        

LOAD_ATTR loads the __contains__ attribute, and CALL_FUNCTION then executes the method.

Using i in s takes just one bytecode:

>>> dis.dis(compile('i in s', '<>', 'exec'))
  1           0 LOAD_NAME                0 (i)
              3 LOAD_NAME                1 (s)
              6 COMPARE_OP               6 (in)
              9 POP_TOP             
             10 LOAD_CONST               0 (None)
             13 RETURN_VALUE        

Here COMPARE_OP does all the work.

In C, then, looking up the __contains__ slot (or rather, its C equivalent) and calling it is much faster.

Note that when comparing two approaches in Python, it is much better to use the timeit module rather than the UNIX time command; it'll try to eliminate environmental factors and repeat the test for you:

>>> import timeit, random
>>> testset = {random.randrange(50000) for _ in xrange(1000)}
>>> tests = [random.randrange(5000) for _ in xrange(500)]
>>> timeit.timeit('for i in tests: i in s',
...               'from __main__ import testset as s, tests',
...               number=100000)
2.5375568866729736
>>> timeit.timeit('for i in tests: s.__contains__(i)',
...               'from __main__ import testset as s, tests',
...               number=100000)
4.208703994750977

Using __contains__ is slower by some distance.

like image 188
Martijn Pieters Avatar answered Dec 31 '22 13:12

Martijn Pieters