Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is collections.Counter much slower than ''.count?

I have a simple task: To count how many times every letter occurs in a string. I've used a Counter() for it, but on one forum I saw information that using dict() / Counter() is much slower than using string.count() for every letter. I thought that it would interate through the string only once, and the string.count() solution would have to iterate through it four times (in this case). Why is Counter() so slow?

>>> timeit.timeit('x.count("A");x.count("G");x.count("C");x.count("T")', setup="x='GAAAAAGTCGTAGGGTTCCTTCACTCGAGGAATGCTGCGACAGTAAAGGAGGCCACGTGGTTGAGAGTTCCTAAGCATTCGTATGTACACCCGGACTCGATGCACTCAAACGTGCTTAAGGGTAAAGAAGGTCGAGAGGTATACTGGGGCACTCCCCTTAGAATTATATCTTGGTCAACTACAATATGGATGGAAATTCTAAGCCGAAAACGACCCGCTAGCGGATTGTGTATGTATCACAACGGTTTCGGTTCATACGCAAAATCATCCCATTTCAAGGCCACTCAAGGACATGACGCCGTGCAACTCCGAGGACATCCCTCAGCGATTGATGCAACCTGGTCATCTAATAATCCTTAGAACGGATGTGCCCTCTACTGGGAGAGCCGGCTAGACTGGCATCTCGCGTTGTTCGTACGAGCTCCGGGCGCCCGGGCGGTGTACGTTGATGTACAGCCTAAGAGCTTTCCACCTATGCTACGAACTAATTTCCCGTCCATCGTTCCTCGGACTGAGGTCAAAGTAACCCGGAAGTACATGGATCAGATACACTCACAGTCCCCTTTAATGACTGAGCTGGACGCTATTGATTGCTTTATAAGTGTTATGGTGAACTCGAAGACTTAGCTAGGAATTTCGCTATACCCGGGTAATGAGCTTAATACCTCACAGCATGTACGCTCTGAATATATGTAGCGATGCTAGCGGAACGTAAGCGTGAGCGTTATGCAGGGCTCCGCACCTCGTGGCCACTCGCCCAATGCCCGAGTTTTTGAGCAATGCCATGCCCTCCAGGTGAAGCGTGCTGAATATGTTCCGCCTCCGCACACCTACCCTACGGGCCTTACGCCATAGCTGAGGATACGCGAGTTGGTTAGCGATTACGTCATTCCAGGTGGTCGTTC'", number=10000)
0.07911698750407936
>>> timeit.timeit('Counter(x)', setup="from collections import Counter;x='GAAAAAGTCGTAGGGTTCCTTCACTCGAGGAATGCTGCGACAGTAAAGGAGGCCACGTGGTTGAGAGTTCCTAAGCATTCGTATGTACACCCGGACTCGATGCACTCAAACGTGCTTAAGGGTAAAGAAGGTCGAGAGGTATACTGGGGCACTCCCCTTAGAATTATATCTTGGTCAACTACAATATGGATGGAAATTCTAAGCCGAAAACGACCCGCTAGCGGATTGTGTATGTATCACAACGGTTTCGGTTCATACGCAAAATCATCCCATTTCAAGGCCACTCAAGGACATGACGCCGTGCAACTCCGAGGACATCCCTCAGCGATTGATGCAACCTGGTCATCTAATAATCCTTAGAACGGATGTGCCCTCTACTGGGAGAGCCGGCTAGACTGGCATCTCGCGTTGTTCGTACGAGCTCCGGGCGCCCGGGCGGTGTACGTTGATGTACAGCCTAAGAGCTTTCCACCTATGCTACGAACTAATTTCCCGTCCATCGTTCCTCGGACTGAGGTCAAAGTAACCCGGAAGTACATGGATCAGATACACTCACAGTCCCCTTTAATGACTGAGCTGGACGCTATTGATTGCTTTATAAGTGTTATGGTGAACTCGAAGACTTAGCTAGGAATTTCGCTATACCCGGGTAATGAGCTTAATACCTCACAGCATGTACGCTCTGAATATATGTAGCGATGCTAGCGGAACGTAAGCGTGAGCGTTATGCAGGGCTCCGCACCTCGTGGCCACTCGCCCAATGCCCGAGTTTTTGAGCAATGCCATGCCCTCCAGGTGAAGCGTGCTGAATATGTTCCGCCTCCGCACACCTACCCTACGGGCCTTACGCCATAGCTGAGGATACGCGAGTTGGTTAGCGATTACGTCATTCCAGGTGGTCGTTC'", number=10000)
2.1727447831030844
>>> 2.1727447831030844 / 0.07911698750407936
27.462430656767047
>>> 
like image 564
SathOkh Avatar asked Jan 06 '13 20:01

SathOkh


People also ask

Are collections counters fast?

Counter is slow, it's actually quite fast, but it's a general purpose tool, counting characters is just one of many applications. On the other hand str. count just counts characters in strings and it's heavily optimized for its one and only task.

Is counter faster than count?

Counter is faster in theory, but has higher fixed overhead, especially compared to str. count , which can scan the underlying C array with direct memory comparisons, where list.

What is the use of collections counter?

The Counter holds the data in an unordered collection, just like hashtable objects. The elements here represent the keys and the count as values. It allows you to count the items in an iterable list. Arithmetic operations like addition, subtraction, intersection, and union can be easily performed on a Counter.

Is Counter efficient Python?

Counting several repeated objects at once is a common problem in programming. Python offers a bunch of tools and techniques you can use to approach this problem. However, Python's Counter from collections provides a clean, efficient, and Pythonic solution.


2 Answers

Counter() allows you to count any hashable objects, not just substrings. Both solutions are O(n)-time. Your measurements show that the overhead of iterating and hashing individual characters by Counter() is greater than running s.count() 4 times.

Counter() can use C helper to count elements but it seems it doesn't special case strings and uses general algorithm applicable for any other iterable i.e., processing a single character involves multiple Python C API calls to advance the iterator, get previous value (a lookup in the hash table), increment counter, set new value (a lookup in the hash table):

    while (1) {
        key = PyIter_Next(it);
        if (key == NULL)
            break;
        oldval = PyObject_GetItem(mapping, key);
        if (oldval == NULL) {
            if (!PyErr_Occurred() || !PyErr_ExceptionMatches(PyExc_KeyError))
                break;
            PyErr_Clear();
            Py_INCREF(one);
            newval = one;
        } else {
            newval = PyNumber_Add(oldval, one);
            Py_DECREF(oldval);
            if (newval == NULL)
                break;
        }
        if (PyObject_SetItem(mapping, key, newval) == -1)
            break;
        Py_CLEAR(newval);
        Py_DECREF(key);
    }

Compare it to FASTSEARCH() overhead for bytestrings:

    for (i = 0; i < n; i++)
        if (s[i] == p[0]) {
           count++;
           if (count == maxcount)
              return maxcount;
        }
    return count;
like image 161
jfs Avatar answered Sep 22 '22 17:09

jfs


The Counter class inherits from dict, while string.count is the following C-implementation (CPython 3.3):

/* stringlib: count implementation */

#ifndef STRINGLIB_FASTSEARCH_H
#error must include "stringlib/fastsearch.h" before including this module
#endif


Py_LOCAL_INLINE(Py_ssize_t)
STRINGLIB(count)(const STRINGLIB_CHAR* str, Py_ssize_t str_len,
                const STRINGLIB_CHAR* sub, Py_ssize_t sub_len,
                Py_ssize_t maxcount)
{
    Py_ssize_t count;

    if (str_len < 0)
        return 0; /* start > len(str) */
    if (sub_len == 0)
        return (str_len < maxcount) ? str_len + 1 : maxcount;

    count = FASTSEARCH(str, str_len, sub, sub_len, maxcount, FAST_COUNT);

    if (count < 0)
        return 0; /* no match */

    return count;
}

Guess, which one is faster? :)

like image 27
Zaur Nasibov Avatar answered Sep 19 '22 17:09

Zaur Nasibov