Why is Collections.counter so slow?

Tags:

I'm trying to solve a Rosalind basic problem of counting nucleotides in a given sequence, and returning the results in a list. For those ones not familiar with bioinformatics it's just counting the number of occurrences of 4 different characters ('A','C','G','T') inside a string.

I expected collections.Counter to be the fastest method (first because they claim to be high-performance, and second because I saw a lot of people using it for this specific problem).

But to my surprise this method is the slowest!

I compared three different methods, using timeit and running two types of experiments:

Running a long sequence few times
Running a short sequence a lot of times.

Here is my code:

Click to copy

import timeit
from collections import Counter

# Method1: using count
def method1(seq):
    return [seq.count('A'), seq.count('C'), seq.count('G'), seq.count('T')]

# method 2: using a loop
def method2(seq):
    r = [0, 0, 0, 0]
    for i in seq:
        if i == 'A':
            r[0] += 1
        elif i == 'C':
            r[1] += 1
        elif i == 'G':
            r[2] += 1
        else:
            r[3] += 1
    return r

# method 3: using Collections.counter
def method3(seq):
    counter = Counter(seq)
    return [counter['A'], counter['C'], counter['G'], counter['T']]


if __name__ == '__main__':

    # Long dummy sequence
    long_seq = 'ACAGCATGCA' * 10000000
    # Short dummy sequence
    short_seq = 'ACAGCATGCA' * 1000

    # Test 1: Running a long sequence once
    print timeit.timeit("method1(long_seq)", setup='from __main__ import method1, long_seq', number=1)
    print timeit.timeit("method2(long_seq)", setup='from __main__ import method2, long_seq', number=1)
    print timeit.timeit("method3(long_seq)", setup='from __main__ import method3, long_seq', number=1)

    # Test2: Running a short sequence lots of times
    print timeit.timeit("method1(short_seq)", setup='from __main__ import method1, short_seq', number=10000)
    print timeit.timeit("method2(short_seq)", setup='from __main__ import method2, short_seq', number=10000)
    print timeit.timeit("method3(short_seq)", setup='from __main__ import method3, short_seq', number=10000)

Results:

Click to copy

Test1: 
Method1: 0.224009990692
Method2: 13.7929501534
Method3: 18.9483819008

Test2:
Method1: 0.224207878113
Method2: 13.8520510197
Method3: 18.9861831665

Method 1 is way faster than method 2 and 3 for both experiments!!

So I have a set of questions:

Am I doing something wrong or it is indeed slower than the other two approaches? Could someone run the same code and share the results?
In case my results are correct, (and maybe this should be another question) is there a faster method to solve this problem than using method 1?
If count is faster, then what's the deal with collections.Counter?

401

asked Jan 11 '17 15:01

Sembei Norimaki

3 Answers

It's not because collections.Counter is slow, it's actually quite fast, but it's a general purpose tool, counting characters is just one of many applications.

On the other hand str.count just counts characters in strings and it's heavily optimized for its one and only task.

That means that str.count can work on the underlying C-char array while it can avoid creating new (or looking up existing) length-1-python-strings during the iteration (which is what for and Counter do).

Just to add some more context to this statement.

A string is stored as C array wrapped as python object. The str.count knows that the string is a contiguous array and thus converts the character you want to co to a C-"character", then iterates over the array in native C code and checks for equality and finally wraps and returns the number of found occurrences.

On the other hand for and Counter use the python-iteration-protocol. Each character of your string will be wrapped as python-object and then it (hashes and) compares them within python.

So the slowdown is because:

Each character has to be converted to a Python object (this is the major reason for the performance loss)
The loop is done in Python (not applicable to Counter in python 3.x because it was rewritten in C)
Each comparison has to be done in Python (instead of just comparing numbers in C - characters are represented by numbers)
The counter needs to hash the values and your loop needs to index your list.

Note the reason for the slowdown is similar to the question about Why are Python's arrays slow?.

I did some additional benchmarks to find out at which point collections.Counter is to be preferred over str.count. To this end I created random strings containing differing numbers of unique characters and plotted the performance:

Click to copy

from collections import Counter
import random
import string

characters = string.printable  # 100 different printable characters

results_counter = []
results_count = []
nchars = []

for i in range(1, 110, 10):
    chars = characters[:i]
    string = ''.join(random.choice(chars) for _ in range(10000))
    res1 = %timeit -o Counter(string)
    res2 = %timeit -o {char: string.count(char) for char in chars}
    nchars.append(len(chars))
    results_counter.append(res1)
    results_count.append(res2)

and the result was plotted using matplotlib:

Click to copy

import matplotlib.pyplot as plt

plt.figure()

plt.plot(nchars, [i.best * 1000 for i in results_counter], label="Counter",   c='black')
plt.plot(nchars, [i.best * 1000 for i in results_count],   label="str.count", c='red')
plt.xlabel('number of different characters')
plt.ylabel('time to count the chars in a string of length 10000 [ms]')
plt.legend()

Results for Python 3.5

The results for Python 3.6 are very similar so I didn't list them explicitly.

enter image description here

So if you want to count 80 different characters Counter becomes faster/comparable because it traverses the string only once and not multiple times like str.count. This will be weakly dependent on the length of the string (but testing showed only a very weak difference +/-2%).

Results for Python 2.7

enter image description here

In Python-2.7 collections.Counter was implemented using python (instead of C) and is much slower. The break-even point for str.count and Counter can only be estimated by extrapolation because even with 100 different characters the str.count is still 6 times faster.

136

answered Oct 01 '22 18:10

MSeifert

The time difference here is pretty simple to explain. It all comes down to what runs within Python and what runs as native code. The latter will always be faster since it does not come with lots of evaluation overhead.

Now that’s already the reason why calling str.count() four times is faster than anything else. Although this iterates the string four times, these loops run in native code. str.count is implemented in C, so this has very little overhead, making this very fast. It’s really difficult to beat this, especially when the task is that simple (looking only for simple character equality).

Your second method, of collecting the counts in an array is actually a less performant version of the following:

Click to copy

def method4 (seq):
    a, c, g, t = 0, 0, 0, 0
    for i in seq:
        if i == 'A':
            a += 1
        elif i == 'C':
            c += 1
        elif i == 'G':
            g += 1
        else:
            t += 1
    return [a, c, g, t]

Here, all four values are individual variables, so updating them is very fast. This is actually a bit faster than mutating list items.

The overall performance “problem” here is however that this iterates the string within Python. So this creates a string iterator and then produces every character individually as an actual string object. That’s a lot overhead and the main reason why every solution that works by iterating the string in Python will be slower.

The same problem is with collection.Counter. It’s implemented in Python so even though it’s very efficient and flexible, it suffers from the same issue that it’s just never near native in terms of speed.

answered Oct 01 '22 18:10

poke

As others have already noted, you are comparing fairly specific code against fairly general one.

Consider that something as trivial as spelling out the loop over the characters you are interested in is already buying you a factor 2, i.e.

Click to copy

def char_counter(text, chars='ACGT'):
    return [text.count(char) for char in chars]


%timeit method1(short_seq)
# 100000 loops, best of 3: 18.8 µs per loop
%timeit char_counter(short_seq)
# 10000 loops, best of 3: 40.8 µs per loop

%timeit method1(long_seq)
# 10 loops, best of 3: 172 ms per loop
%timeit char_counter(long_seq)
# 1 loop, best of 3: 374 ms per loop

Your method1() is the fastest but not the most efficient, as the input is looped through entirely for each char you are inspecting, thereby not taking advantage of the fact that you could easily short-circuit your looping as soon as a character gets assigned to one of the character classes.

Unfortunately, Python does not offer a fast method to take advantage of the specific conditions of your problem. However, you could use Cython for this, and you would then be able to outperform your method1():

Click to copy

%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True

import numpy as np


cdef void _count_acgt(
        const unsigned char[::1] text,
        unsigned long len_text,
        unsigned long[::1] counts):
    for i in range(len_text):
        if text[i] == b'A':
            counts[0] += 1
        elif text[i] == b'C':
            counts[1] += 1
        elif text[i] == b'G':
            counts[2] += 1
        else:
            counts[3] += 1


cpdef ascii_count_acgt(text):
    counts = np.zeros(4, dtype=np.uint64)
    bin_text = text.encode()
    return _count_acgt(bin_text, len(bin_text), counts)

Click to copy

%timeit ascii_count_acgt(short_seq)
# 100000 loops, best of 3: 12.6 µs per loop
%timeit ascii_count_acgt(long_seq)
# 10 loops, best of 3: 140 ms per loop

answered Oct 01 '22 19:10

norok2

Related questions
                            
                                Accessing key in factory of defaultdict
                            
                                Suppress code in NBConvert? IPython
                            
                                Python Add string to each line in a file
                            
                                Django Celery send register email do not work
                            
                                Django-rest-framework permissions for create in viewset
                            
                                How to make a post with a from data of empty json through HTTPie?
                            
                                Celery task always PENDING
                            
                                Draggable line with draggable points
                            
                                Equal Error Rate in Python
                            
                                How to list all unused jenkins plugins?
                            
                                Python, how to enable all warnings?
                            
                                Can't open video using opencv
                            
                                Django: show the count of related objects in admin list_display
                            
                                OSError: dlopen(libSystem.dylib, 6): image not found
                            
                                How to get boxplot data for matplotlib boxplots
                            
                                Does GridSearchCV store all the scores for all parameter combinations?
                            
                                Django and 'virtualenv' - proper project structure
                            
                                Subprocess timeout failure
                            
                                Add a new sheet to a existing workbook in python
                            
                                How to generate a unique auth token in python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is Collections.counter so slow?

Tags:

performance

python

collections

counter

bioinformatics

Sembei Norimaki

People also ask

3 Answers

Results for Python 3.5

Results for Python 2.7

MSeifert

poke

norok2

Recent Activity

Donate For Us