Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Surprising results with Python timeit: Counter() vs defaultdict() vs dict()

I obtained very surprising results with timeit, can someone tell me if I am doing something wrong ? I am using Python 2.7.

This is the contents of file speedtest_init.py:

import random

to_count = [random.randint(0, 100) for r in range(60)]

These are the contents of speedtest.py:

__author__ = 'BlueTrin'

import timeit

def test_init1():
    print(timeit.timeit('import speedtest_init'))

def test_counter1():
    s = """\
    d = defaultdict(int);
    for i in speedtest_init.to_count:
        d[i] += 1
    """
    print(timeit.timeit(s, 'from collections import defaultdict; import speedtest_init;'))

def test_counter2():
    print(timeit.timeit('d = Counter(speedtest_init.to_count);', 'from collections import Counter; import speedtest_init;'))


if __name__ == "__main__":
    test_init1()
    test_counter1()
    test_counter2()

The console output is:

C:\Python27\python.exe C:/Dev/codility/chlorum2014/speedtest.py
2.71501962931
65.7090444503
91.2953839048

Process finished with exit code 0

I think by default timeit() runs 1000000 times the code, so I need to divide the times by 1000000, but what is surprising is that the Counter is slower than the defaultdict().

Is that expected ?

EDIT:

Also using a dict is faster than a defaultdict(int):

def test_counter3():
    s = """\
    d = {};
    for i in speedtest_init.to_count:
        if i not in d:
            d[i] = 1
        else:
            d[i] += 1
    """
    print(timeit.timeit(stmt=s, setup='from collections import defaultdict; import speedtest_init;')

this last version is faster than the defaultdict(int) meaning that unless you care more about readability you should use the dict() rather than the defaultdict().

like image 677
BlueTrin Avatar asked Jan 06 '15 15:01

BlueTrin


People also ask

What is the use of defaultdict in Python?

word_count_dict = {} We could use defaultdict to reduce the number of lines in the code. We could also have used Counter to do this, which according to me is the most preferable method for this problem. If we use Counter, we can also get the most common words using a simple function. Other use cases of Counter: So, why ever use defaultdict ?

Is the DICT () faster than the defaultdict (int)?

this last version is faster than the defaultdict (int) meaning that unless you care more about readability you should use the dict () rather than the defaultdict (). Show activity on this post. Yes, this is expected; the Counter () constructor uses Counter.update () which uses self.get () to load initial values rather than rely on __missing__.

Why use defaultdict instead of counter?

We could use defaultdict to reduce the number of lines in the code. We could also have used Counter to do this, which according to me is the most preferable method for this problem. If we use Counter, we can also get the most common words using a simple function. Other use cases of Counter: So, why ever use defaultdict ?

What is the difference between counter () and defaultdict (INT) in MySQL?

Both Counterand defaultdict(int)can work fine here, but there are few differences between them: Countersupports most of the operations you can do on a multiset. So, if you want to use those operation then go for Counter. Counterwon't add new keys to the dict when you query for missing keys.


1 Answers

Yes, this is expected; the Counter() constructor uses Counter.update() which uses self.get() to load initial values rather than rely on __missing__.

Moreover, the defaultdict __missing__ factory is handled entirely in C code, especially when using another type like int() that is itself implemented in C. The Counter source is pure Python and as such the Counter.__missing__ method requires a Python frame to execute.

Because dict.get() is still handled in C, the constructor approach is the faster approach for a Counter(), provided you use the same trick Counter.update() uses and alias the self.get lookup as a local first:

>>> import timeit
>>> import random
>>> import sys
>>> sys.version_info
sys.version_info(major=2, minor=7, micro=9, releaselevel='final', serial=0)
>>> to_count = [random.randint(0, 100) for r in range(60)]
>>> timeit.timeit('for i in to_count: c[i] += 1',
...               'from collections import Counter; from __main__ import to_count; c = Counter()',
...               number=10000)
0.2510359287261963
>>> timeit.timeit('for i in to_count: c[i] = c_get(i, 0) + 1',
...               'from collections import Counter; from __main__ import to_count; c = Counter(); c_get = c.get',
...               number=10000)
0.20978617668151855

Both defaultdict and Counter are helpful classes built for their functionality, not their performance; not relying on the __missing__ hook can be faster still:

>>> timeit.timeit('for i in to_count: d[i] = d_get(i, 0) + 1',
...               'from __main__ import to_count; d = {}; d_get = d.get',
...               number=10000)
0.11437392234802246

That's a regular dictionary using an aliased dict.get() method for maximum speed. But then you'll also have to re-implement the bag behaviour of Counter, or the Counter.most_common() method. The defaultdict use cases go way beyond counting.

In Python 3.2, updating a Counter() got a speed boost by adding a C library that handles this case; see issue 10667. Testing on Python 3.4, the Counter() constructor now beats the aliased dict.get case:

>>> timeit.timeit('Counter(to_count)',
...               'from collections import Counter; from __main__ import to_count',
...               number=100000)
0.8332311600097455
>>> timeit.timeit('for i in to_count: d[i] = d_get(i, 0) + 1',
...               'from __main__ import to_count; d = {}; d_get = d.get',
...               number=100000)
0.961191965994658
>>> import sys
>>> sys.version_info
sys.version_info(major=3, minor=4, micro=2, releaselevel='final', serial=0)

(Note: to get a meaningful timing result the number of iterations was increased from 10k to 100k; so if you compare these against the dict.get() case above you need to take the timing there times ten, at 1.144 seconds).

like image 98
Martijn Pieters Avatar answered Sep 24 '22 17:09

Martijn Pieters