I have a simple task: To count how many times every letter occurs in a string. I've used a Counter()
for it, but on one forum I saw information that using dict()
/ Counter()
is much slower than using string.count()
for every letter. I thought that it would interate through the string only once, and the string.count()
solution would have to iterate through it four times (in this case). Why is Counter()
so slow?
>>> timeit.timeit('x.count("A");x.count("G");x.count("C");x.count("T")', setup="x='GAAAAAGTCGTAGGGTTCCTTCACTCGAGGAATGCTGCGACAGTAAAGGAGGCCACGTGGTTGAGAGTTCCTAAGCATTCGTATGTACACCCGGACTCGATGCACTCAAACGTGCTTAAGGGTAAAGAAGGTCGAGAGGTATACTGGGGCACTCCCCTTAGAATTATATCTTGGTCAACTACAATATGGATGGAAATTCTAAGCCGAAAACGACCCGCTAGCGGATTGTGTATGTATCACAACGGTTTCGGTTCATACGCAAAATCATCCCATTTCAAGGCCACTCAAGGACATGACGCCGTGCAACTCCGAGGACATCCCTCAGCGATTGATGCAACCTGGTCATCTAATAATCCTTAGAACGGATGTGCCCTCTACTGGGAGAGCCGGCTAGACTGGCATCTCGCGTTGTTCGTACGAGCTCCGGGCGCCCGGGCGGTGTACGTTGATGTACAGCCTAAGAGCTTTCCACCTATGCTACGAACTAATTTCCCGTCCATCGTTCCTCGGACTGAGGTCAAAGTAACCCGGAAGTACATGGATCAGATACACTCACAGTCCCCTTTAATGACTGAGCTGGACGCTATTGATTGCTTTATAAGTGTTATGGTGAACTCGAAGACTTAGCTAGGAATTTCGCTATACCCGGGTAATGAGCTTAATACCTCACAGCATGTACGCTCTGAATATATGTAGCGATGCTAGCGGAACGTAAGCGTGAGCGTTATGCAGGGCTCCGCACCTCGTGGCCACTCGCCCAATGCCCGAGTTTTTGAGCAATGCCATGCCCTCCAGGTGAAGCGTGCTGAATATGTTCCGCCTCCGCACACCTACCCTACGGGCCTTACGCCATAGCTGAGGATACGCGAGTTGGTTAGCGATTACGTCATTCCAGGTGGTCGTTC'", number=10000)
0.07911698750407936
>>> timeit.timeit('Counter(x)', setup="from collections import Counter;x='GAAAAAGTCGTAGGGTTCCTTCACTCGAGGAATGCTGCGACAGTAAAGGAGGCCACGTGGTTGAGAGTTCCTAAGCATTCGTATGTACACCCGGACTCGATGCACTCAAACGTGCTTAAGGGTAAAGAAGGTCGAGAGGTATACTGGGGCACTCCCCTTAGAATTATATCTTGGTCAACTACAATATGGATGGAAATTCTAAGCCGAAAACGACCCGCTAGCGGATTGTGTATGTATCACAACGGTTTCGGTTCATACGCAAAATCATCCCATTTCAAGGCCACTCAAGGACATGACGCCGTGCAACTCCGAGGACATCCCTCAGCGATTGATGCAACCTGGTCATCTAATAATCCTTAGAACGGATGTGCCCTCTACTGGGAGAGCCGGCTAGACTGGCATCTCGCGTTGTTCGTACGAGCTCCGGGCGCCCGGGCGGTGTACGTTGATGTACAGCCTAAGAGCTTTCCACCTATGCTACGAACTAATTTCCCGTCCATCGTTCCTCGGACTGAGGTCAAAGTAACCCGGAAGTACATGGATCAGATACACTCACAGTCCCCTTTAATGACTGAGCTGGACGCTATTGATTGCTTTATAAGTGTTATGGTGAACTCGAAGACTTAGCTAGGAATTTCGCTATACCCGGGTAATGAGCTTAATACCTCACAGCATGTACGCTCTGAATATATGTAGCGATGCTAGCGGAACGTAAGCGTGAGCGTTATGCAGGGCTCCGCACCTCGTGGCCACTCGCCCAATGCCCGAGTTTTTGAGCAATGCCATGCCCTCCAGGTGAAGCGTGCTGAATATGTTCCGCCTCCGCACACCTACCCTACGGGCCTTACGCCATAGCTGAGGATACGCGAGTTGGTTAGCGATTACGTCATTCCAGGTGGTCGTTC'", number=10000)
2.1727447831030844
>>> 2.1727447831030844 / 0.07911698750407936
27.462430656767047
>>>
Counter is slow, it's actually quite fast, but it's a general purpose tool, counting characters is just one of many applications. On the other hand str. count just counts characters in strings and it's heavily optimized for its one and only task.
Counter is faster in theory, but has higher fixed overhead, especially compared to str. count , which can scan the underlying C array with direct memory comparisons, where list.
The Counter holds the data in an unordered collection, just like hashtable objects. The elements here represent the keys and the count as values. It allows you to count the items in an iterable list. Arithmetic operations like addition, subtraction, intersection, and union can be easily performed on a Counter.
Counting several repeated objects at once is a common problem in programming. Python offers a bunch of tools and techniques you can use to approach this problem. However, Python's Counter from collections provides a clean, efficient, and Pythonic solution.
Counter()
allows you to count any hashable objects, not just substrings. Both solutions are O(n)
-time. Your measurements show that the overhead of iterating and hashing individual characters by Counter()
is greater than running s.count()
4 times.
Counter()
can use C helper to count elements but it seems it doesn't special case strings and uses general algorithm applicable for any other iterable i.e., processing a single character involves multiple Python C API calls to advance the iterator, get previous value (a lookup in the hash table), increment counter, set new value (a lookup in the hash table):
while (1) {
key = PyIter_Next(it);
if (key == NULL)
break;
oldval = PyObject_GetItem(mapping, key);
if (oldval == NULL) {
if (!PyErr_Occurred() || !PyErr_ExceptionMatches(PyExc_KeyError))
break;
PyErr_Clear();
Py_INCREF(one);
newval = one;
} else {
newval = PyNumber_Add(oldval, one);
Py_DECREF(oldval);
if (newval == NULL)
break;
}
if (PyObject_SetItem(mapping, key, newval) == -1)
break;
Py_CLEAR(newval);
Py_DECREF(key);
}
Compare it to FASTSEARCH()
overhead for bytestrings:
for (i = 0; i < n; i++)
if (s[i] == p[0]) {
count++;
if (count == maxcount)
return maxcount;
}
return count;
The Counter
class inherits from dict
, while string.count
is the following C-implementation (CPython 3.3):
/* stringlib: count implementation */
#ifndef STRINGLIB_FASTSEARCH_H
#error must include "stringlib/fastsearch.h" before including this module
#endif
Py_LOCAL_INLINE(Py_ssize_t)
STRINGLIB(count)(const STRINGLIB_CHAR* str, Py_ssize_t str_len,
const STRINGLIB_CHAR* sub, Py_ssize_t sub_len,
Py_ssize_t maxcount)
{
Py_ssize_t count;
if (str_len < 0)
return 0; /* start > len(str) */
if (sub_len == 0)
return (str_len < maxcount) ? str_len + 1 : maxcount;
count = FASTSEARCH(str, str_len, sub, sub_len, maxcount, FAST_COUNT);
if (count < 0)
return 0; /* no match */
return count;
}
Guess, which one is faster? :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With