Why is collections.Counter much slower than ''.count?

Tags:

I have a simple task: To count how many times every letter occurs in a string. I've used a Counter() for it, but on one forum I saw information that using dict() / Counter() is much slower than using string.count() for every letter. I thought that it would interate through the string only once, and the string.count() solution would have to iterate through it four times (in this case). Why is Counter() so slow?

>>> timeit.timeit('x.count("A");x.count("G");x.count("C");x.count("T")', setup="x='GAAAAAGTCGTAGGGTTCCTTCACTCGAGGAATGCTGCGACAGTAAAGGAGGCCACGTGGTTGAGAGTTCCTAAGCATTCGTATGTACACCCGGACTCGATGCACTCAAACGTGCTTAAGGGTAAAGAAGGTCGAGAGGTATACTGGGGCACTCCCCTTAGAATTATATCTTGGTCAACTACAATATGGATGGAAATTCTAAGCCGAAAACGACCCGCTAGCGGATTGTGTATGTATCACAACGGTTTCGGTTCATACGCAAAATCATCCCATTTCAAGGCCACTCAAGGACATGACGCCGTGCAACTCCGAGGACATCCCTCAGCGATTGATGCAACCTGGTCATCTAATAATCCTTAGAACGGATGTGCCCTCTACTGGGAGAGCCGGCTAGACTGGCATCTCGCGTTGTTCGTACGAGCTCCGGGCGCCCGGGCGGTGTACGTTGATGTACAGCCTAAGAGCTTTCCACCTATGCTACGAACTAATTTCCCGTCCATCGTTCCTCGGACTGAGGTCAAAGTAACCCGGAAGTACATGGATCAGATACACTCACAGTCCCCTTTAATGACTGAGCTGGACGCTATTGATTGCTTTATAAGTGTTATGGTGAACTCGAAGACTTAGCTAGGAATTTCGCTATACCCGGGTAATGAGCTTAATACCTCACAGCATGTACGCTCTGAATATATGTAGCGATGCTAGCGGAACGTAAGCGTGAGCGTTATGCAGGGCTCCGCACCTCGTGGCCACTCGCCCAATGCCCGAGTTTTTGAGCAATGCCATGCCCTCCAGGTGAAGCGTGCTGAATATGTTCCGCCTCCGCACACCTACCCTACGGGCCTTACGCCATAGCTGAGGATACGCGAGTTGGTTAGCGATTACGTCATTCCAGGTGGTCGTTC'", number=10000)
0.07911698750407936
>>> timeit.timeit('Counter(x)', setup="from collections import Counter;x='GAAAAAGTCGTAGGGTTCCTTCACTCGAGGAATGCTGCGACAGTAAAGGAGGCCACGTGGTTGAGAGTTCCTAAGCATTCGTATGTACACCCGGACTCGATGCACTCAAACGTGCTTAAGGGTAAAGAAGGTCGAGAGGTATACTGGGGCACTCCCCTTAGAATTATATCTTGGTCAACTACAATATGGATGGAAATTCTAAGCCGAAAACGACCCGCTAGCGGATTGTGTATGTATCACAACGGTTTCGGTTCATACGCAAAATCATCCCATTTCAAGGCCACTCAAGGACATGACGCCGTGCAACTCCGAGGACATCCCTCAGCGATTGATGCAACCTGGTCATCTAATAATCCTTAGAACGGATGTGCCCTCTACTGGGAGAGCCGGCTAGACTGGCATCTCGCGTTGTTCGTACGAGCTCCGGGCGCCCGGGCGGTGTACGTTGATGTACAGCCTAAGAGCTTTCCACCTATGCTACGAACTAATTTCCCGTCCATCGTTCCTCGGACTGAGGTCAAAGTAACCCGGAAGTACATGGATCAGATACACTCACAGTCCCCTTTAATGACTGAGCTGGACGCTATTGATTGCTTTATAAGTGTTATGGTGAACTCGAAGACTTAGCTAGGAATTTCGCTATACCCGGGTAATGAGCTTAATACCTCACAGCATGTACGCTCTGAATATATGTAGCGATGCTAGCGGAACGTAAGCGTGAGCGTTATGCAGGGCTCCGCACCTCGTGGCCACTCGCCCAATGCCCGAGTTTTTGAGCAATGCCATGCCCTCCAGGTGAAGCGTGCTGAATATGTTCCGCCTCCGCACACCTACCCTACGGGCCTTACGCCATAGCTGAGGATACGCGAGTTGGTTAGCGATTACGTCATTCCAGGTGGTCGTTC'", number=10000)
2.1727447831030844
>>> 2.1727447831030844 / 0.07911698750407936
27.462430656767047
>>>

564

asked Jan 06 '13 20:01

SathOkh

2 Answers

Counter() allows you to count any hashable objects, not just substrings. Both solutions are O(n)-time. Your measurements show that the overhead of iterating and hashing individual characters by Counter() is greater than running s.count() 4 times.

Counter() can use C helper to count elements but it seems it doesn't special case strings and uses general algorithm applicable for any other iterable i.e., processing a single character involves multiple Python C API calls to advance the iterator, get previous value (a lookup in the hash table), increment counter, set new value (a lookup in the hash table):

    while (1) {
        key = PyIter_Next(it);
        if (key == NULL)
            break;
        oldval = PyObject_GetItem(mapping, key);
        if (oldval == NULL) {
            if (!PyErr_Occurred() || !PyErr_ExceptionMatches(PyExc_KeyError))
                break;
            PyErr_Clear();
            Py_INCREF(one);
            newval = one;
        } else {
            newval = PyNumber_Add(oldval, one);
            Py_DECREF(oldval);
            if (newval == NULL)
                break;
        }
        if (PyObject_SetItem(mapping, key, newval) == -1)
            break;
        Py_CLEAR(newval);
        Py_DECREF(key);
    }

Compare it to FASTSEARCH() overhead for bytestrings:

    for (i = 0; i < n; i++)
        if (s[i] == p[0]) {
           count++;
           if (count == maxcount)
              return maxcount;
        }
    return count;

161

answered Sep 22 '22 17:09

jfs

The Counter class inherits from dict, while string.count is the following C-implementation (CPython 3.3):

/* stringlib: count implementation */

#ifndef STRINGLIB_FASTSEARCH_H
#error must include "stringlib/fastsearch.h" before including this module
#endif


Py_LOCAL_INLINE(Py_ssize_t)
STRINGLIB(count)(const STRINGLIB_CHAR* str, Py_ssize_t str_len,
                const STRINGLIB_CHAR* sub, Py_ssize_t sub_len,
                Py_ssize_t maxcount)
{
    Py_ssize_t count;

    if (str_len < 0)
        return 0; /* start > len(str) */
    if (sub_len == 0)
        return (str_len < maxcount) ? str_len + 1 : maxcount;

    count = FASTSEARCH(str, str_len, sub, sub_len, maxcount, FAST_COUNT);

    if (count < 0)
        return 0; /* no match */

    return count;
}

Guess, which one is faster? :)

answered Sep 19 '22 17:09

Zaur Nasibov

Related questions
                            
                                Python: List Comprehension and Functional Programming
                            
                                How do I create a 404 page?
                            
                                Build a dictionary from successful regex matches in python
                            
                                numpy: applying argsort to an array
                            
                                How can I vectorize this triple-loop over 2d arrays in numpy?
                            
                                how to Iterate a mongo cursor in a loop in python
                            
                                passing variables to a template on a redirect in python
                            
                                Why does this python dictionary get created out of order using setdefault()?
                            
                                WeakValueDictionary retaining reference to object with no more strong references
                            
                                "after" looping indefinitely: never entering mainloop
                            
                                How to get Address from Latitude & Longitude in Django GeoIP?
                            
                                Cell assignment of a 2-dimensional Matrix in Python, without numpy
                            
                                Trouble querying ListField with mongoengine
                            
                                operator python parameter
                            
                                Is there a middle ground between `zip` and `zip_longest`
                            
                                How would you install a python module with chef?
                            
                                Using Pygame with PyPy
                            
                                Splitting a list of sequences into two lists efficiently [duplicate]
                            
                                Having line color vary with data index for line graph in matplotlib?
                            
                                how to use socket fetch webpage use python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is collections.Counter much slower than ''.count?

Tags:

performance

python

counter

SathOkh

People also ask

2 Answers

jfs

Zaur Nasibov

Recent Activity

Donate For Us