Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Passing iterators to any for execution for speed and Why?

Questions are summarized here. Yes, I know some of these answers ;) and I can do some hand-waving on others, but I'd really like to get to the nitty-gritty here.

  1. Is this even a good idea at all? (This one is not below)
  2. I wonder if the map actually adds speed improvements? Why?
  3. Why in the world would passing iterators to any make my code faster?
  4. Why did my Counter object work and my print_true function fail miserably?
  5. Is there an equivalent to itertools.imap that will just call a function over and over again, and optionally a certain amount of times?
  6. Where is my carrot?!?

I just watched PyCon 2011: How Dropbox Did It and How Python Helped (admittedly I skipped through most of the parts), but FINALLY the really interesting stuff started at around 22:23.

The speaker advocated making your inner loops in C and that "run once" stuff doesn't need much optimization (make sense)... then he goes on to state... paraphrased:

Pass a composition of iterators to any for massive speed improvements.

Here is the code (hopefully it's identical):

import itertools, hashlib, time   
_md5 = hashlib.md5()  
def run():
    for i in itertools.repeat("foo", 10000000):
        _md5.update(i)
a = time.time();  run(); time.time() - a  
Out[118]: 9.44077205657959

_md5 = hashlib.md5() 
def run():
    any(itertools.imap(_md5.update, itertools.repeat("foo", 10000000)))    
a = time.time();  run(); time.time() - a
Out[121]: 6.547091007232666

Hmm, looks like for even greater speed improvements I can just get a faster computer! (Judging from his slide.)

He then does a bunch of hand-waving without actually going into details as to why.

I already knew about the iterators from the answer to pythonic way to do something N times without an index variable? thanks to Alex Martelli.

Then I thought, I wonder if the map actually adds speed improvements? My final thought was WTF??? passing to any? REALLY??? Surely that can't be right since the documentation defines any as:

def any(iterable):
    for element in iterable:
        if element:
            return True
    return False

Why in the world would passing iterators to any make my code faster?

I then tested it using the following (among many other tests) but this is what gets me:

def print_true(x):
    print 'True'
    return 'Awesome'

def test_for_loop_over_iter_map_over_iter_repeat():
    for result in itertools.imap(print_true, itertools.repeat("foo", 5)):
        pass

def run_any_over_iter_map_over_iter_repeat():
    any(itertools.imap(print_true, itertools.repeat("foo", 5)))

And the runs:

    In [67]: test_for_loop_over_iter_map_over_iter_repeat()
    True
    True
    True
    True
    True

    In [74]: run_any_over_iter_map_over_iter_repeat()
    True

For shame. I declared this GUY IS FULL OF IT. Heretic! But, I calmed down and continued to test. If this were true how in the hell could Dropbox even work!?!?

And with further testing it did work... I initially just used a simple counter object, and it counted all the way up to 10000000 in both cases.

So the question is why did my Counter object work and my print_true function fail miserably?

class Counter(object):
    count = 0
    def count_one(self, none):
        self.count += 1

def run_any_counter():
    counter = Counter()
    any(itertools.imap(counter.count_one, itertools.repeat("foo", 10000000)))
    print counter.count

def run_for_counter():
    counter = Counter()
    for result in itertools.imap(counter.count_one, itertools.repeat("foo", 10000000)):
        pass
    print counter.count

output:

%time run_for_counter()
10000000
CPU times: user 5.54 s, sys: 0.03 s, total: 5.57 s
Wall time: 5.68 s

%time run_any_counter()
10000000
CPU times: user 5.28 s, sys: 0.02 s, total: 5.30 s
Wall time: 5.40 s

An even bigger WTF is even after removing the unneeded argument and writing the most sensible code for my Counter object, it's STILL slower than the any-map version. Where is my carrot?!?:

class CounterNoArg(object):
    count = 0
    def count_one(self):
        self.count += 1

def straight_count():
    counter = CounterNoArg()
    for _ in itertools.repeat(None, 10000000):
        counter.count_one()
    print counter.count

Ouput:

In [111]: %time straight_count()
10000000
CPU times: user 5.44 s, sys: 0.02 s, total: 5.46 s
Wall time: 5.60 s

I'm asking because I think the Pythonistas or Pythoneers need a carrot so we don't start passing stuff to any or all for a performance increase, or does one already exist? Possibly an equivalent to itertools.imap that will just call a function over and over again, and optionally a certain amount of times.

The best I've managed are (using list comprehension gives interesting results):

def super_run():
    counter = CounterNoArg()
    for _ in (call() for call in itertools.repeat(counter.count_one, 10000000)):
        pass
    print counter.count

def super_counter_run():
    counter = CounterNoArg()
    [call() for call in itertools.repeat(counter.count_one, 10000000)]
    print counter.count

def run_any_counter():
    counter = Counter()
    any(itertools.imap(counter.count_one, itertools.repeat("foo", 10000000)))
    print counter.count

%time super_run()
10000000
CPU times: user 5.23 s, sys: 0.03 s, total: 5.26 s
Wall time: 5.43 s

%time super_counter_run()
10000000
CPU times: user 4.75 s, sys: 0.18 s, total: 4.94 s
Wall time: 5.80 s

%time run_any_counter()
10000000
CPU times: user 5.15 s, sys: 0.06 s, total: 5.21 s
Wall time: 5.30 s

def run_any_like_presentation():
    any(itertools.imap(_md5.update, itertools.repeat("foo", 10000000)))

def super_run_like_presentation():
    [do_work for do_work in itertools.imap(_md5.update, itertools.repeat("foo", 10000000))]

def super_run_like_presentation_2():
    [_md5.update(foo) for foo in itertools.repeat("foo", 10000000)]


%time run_any_like_presentation()
CPU times: user 5.28 s, sys: 0.02 s, total: 5.29 s
Wall time: 5.47 s

%time super_run_like_presentation()
CPU times: user 6.14 s, sys: 0.18 s, total: 6.33 s
Wall time: 7.56 s

%time super_run_like_presentation_2()
CPU times: user 8.44 s, sys: 0.22 s, total: 8.66 s
Wall time: 9.59 s

Ugh...

Note: I encourage you to run the tests yourself.

like image 513
Derek Litz Avatar asked Feb 04 '12 21:02

Derek Litz


2 Answers

In your first example, the first version of run has to look up _md5.update each time round the loop, whereas the second version does not. I think you'll find that accounts for most of the performance difference. The rest is probably to do with having to set the local variable i, though that's not so easy to demonstrate.

import itertools, hashlib, timeit
_md5 = hashlib.md5()

def run1():
    for i in itertools.repeat("foo", 10000000):
        _md5.update(i)

def run2():
    u = _md5.update
    for i in itertools.repeat("foo", 10000000):
        u(i)

def run3():
    any(itertools.imap(_md5.update, itertools.repeat("foo", 10000000)))

>>> timeit.timeit('run1()', 'from __main__ import run1', number=1)
6.081272840499878
>>> timeit.timeit('run2()', 'from __main__ import run2', number=1)
4.660238981246948
>>> timeit.timeit('run3()', 'from __main__ import run3', number=1)
4.062871932983398

The itertools documentation has a better recipe for consuming an iterator (and discarding all its values): see the consume function. The use of any to do this job depends on the fact that _md5.update always returns None, so this approach doesn't work in general. Also, the recipe is very slightly faster: [see comments]

import collections

def consume(it):
    "Consume iterator completely (discarding its values)."
    collections.deque(it, maxlen=0)

def run4():
    consume(itertools.imap(_md5.update, itertools.repeat("foo", 10000000)))

>>> timeit.timeit('run4()', 'from __main__ import run4', number=1)
3.969902992248535

Edited to add: it seems that the consume recipe is not as well known as it should be: if you look at the details of the CPython implementation, you'll see that when collections.deque is called with maxlen=0 then it calls the function consume_iterator in _collectionsmodule.c, which looks like this:

static PyObject*
consume_iterator(PyObject *it)
{
    PyObject *item;
    while ((item = PyIter_Next(it)) != NULL) {
        Py_DECREF(item);
    }
    Py_DECREF(it);
    if (PyErr_Occurred())
        return NULL;
    Py_RETURN_NONE;
}
like image 136
Gareth Rees Avatar answered Oct 22 '22 09:10

Gareth Rees


To answer the first question about optimizing by passing to any. No, I believe it is not a good idea for the main reason that it is not it's intended purpose. Sure it's easy to implement, but maintenance could become a nightmare. By doing this a new gotcha is introduced into your code base. If the function ever returns false, then the iterator will not be fully consumed causing strange behavior, and bugs that are hard to track down. Also, there exist faster (or at least nearly as fast) alternatives to using the built-in any.

Of course, you can make an exception because it seems any can actually out perform deque, but using any is certainly extreme and most often unnecessary. In fact, if anything, you may be introducing optimizations they may no longer be "optimal" after updates to the Python code base (see 2.7 vs 3.2).

Another thing to mention is this use of any doesn't immediately make any sense. Whether or not to implement a C extension before using any like this is also debatable. Personally, I'd prefer it for semantic reasons.

As far as optimizing your own code let's start with what we're up against: refer to run_any_like_presentation. It's pretty fast :)

An initial implementation could look something like:

import itertools, hashlib
_md5 = hashlib.md5()
def run():
    for _ in xrange(100000000):
        _md5.update("foo")

The first step is using itertools.repeat to do something N times.

def run_just_repeat():
    for foo in itertools.repeat("foo", 100000000):
        _md5.update(foo)

The second second optimization is to use itertools.imap to get a speed increase from not having to pass the foo reference in Python code. It is now in C.

def run_imap_and_repeat():
    for do_work in itertools.imap(_md5.update, itertools.repeat("foo", 10000000)):
        pass

The third optimization is to move the for loop entirely into C code.

import collections
def run_deque_imap_and_repeat():
    collections.deque(itertools.imap(_md5.update, itertools.repeat("foo", 10000000)))

The final optimization is to move all potential looks ups into the namespace of the run function:

This idea is taken from the very end of http://docs.python.org/library/itertools.html?highlight=itertools

Note, many of the above recipes can be optimized by replacing global lookups with local variables defined as default values.

Personally, I had mixed success with this showing improvements. ie. Small improvements under certain conditions, from module import xxx also showing performance increases without passing it in as well. Also, sometimes if I pass in some variables, and not others I see slight differences as well. The point is, I feel this one your going to need to test yourself to see if it works for you.

def run_deque_imap_and_repeat_all_local(deque = collections.deque, 
        imap = itertools.imap, _md5 = _md5, repeat = itertools.repeat, 
        md5 = hashlib.md5):
    update = _md5.update
    deque(imap(_md5.update, repeat("foo", 100000000)), maxlen = 0)

And finally to be fair let's implement an any version like the presentation that does the final optimization as well.

def run_any_like_presentation_all_local(any = any, deque = collections.deque, 
        imap = itertools.imap, _md5 = _md5, repeat = itertools.repeat, 
        md5 = hashlib.md5):
    any(imap(_md5.update, repeat("foo", 100000000)))

Ok now let's run some tests (Python 2.7.2 OS X Snow Leopard 64-bit):

  • run_reference - 123.913 seconds

  • run_deque_imap_and_repeat_all_local - 51.201 seconds

  • run_deque_local_imap_and_repeat - 53.013 seconds

  • run_deque_imap_and_repeat - 48.913 seconds

  • run_any_like_presentation - 49.833 seconds

  • run_any_like_presentation_all_local - 47.780 seconds

And just for kicks in Python3 (Python 3.2 OS X Snow Leopard 64-bit):

  • run_reference - 94.273 seconds (100000004 function calls!)

  • run_deque_imap_and_repeat_all_local - 23.929 seconds

  • run_deque_local_imap_and_repeat - 23.298 seconds

  • run_deque_imap_and_repeat - 24.201 seconds

  • run_any_like_presentation - 24.026 seconds

  • run_any_like_presentation_all_local - 25.316 seconds

Here's my source for the tests:

import itertools, hashlib, collections
_md5 = hashlib.md5()

def run_reference():
    for _ in xrange(100000000):
        _md5.update("foo")

def run_deque_imap_and_repeat_all_local(deque = collections.deque,
        imap = itertools.imap, _md5 = _md5, repeat = itertools.repeat,
        md5 = hashlib.md5):
    deque(imap(_md5.update, repeat("foo", 100000000)), maxlen = 0)

def run_deque_local_imap_and_repeat(deque = collections.deque,
        imap = itertools.imap, _md5 = _md5, repeat = itertools.repeat,
        md5 = hashlib.md5):
    deque(imap(_md5.update, repeat("foo", 100000000)), maxlen = 0)

def run_deque_imap_and_repeat():
    collections.deque(itertools.imap(_md5.update, itertools.repeat("foo", 100000000)),
            maxlen = 0)

def run_any_like_presentation():
    any(itertools.imap(_md5.update, itertools.repeat("foo", 100000000)))

def run_any_like_presentation_all_local(any = any, deque = collections.deque,
        imap = itertools.imap, _md5 = _md5, repeat = itertools.repeat,
        md5 = hashlib.md5):
    any(imap(_md5.update, repeat("foo", 100000000)))

import cProfile
import pstats

def performance_test(a_func):
    cProfile.run(a_func, 'stats')
    p = pstats.Stats('stats')
    p.sort_stats('time').print_stats(10)

performance_test('run_reference()')
performance_test('run_deque_imap_and_repeat_all_local()')
performance_test('run_deque_local_imap_and_repeat()')
performance_test('run_deque_imap_and_repeat()')
performance_test('run_any_like_presentation()')
performance_test('run_any_like_presentation_all_local()')

And Python3

import itertools, hashlib, collections
_md5 = hashlib.md5()

def run_reference(foo = "foo".encode('utf-8')):
    for _ in range(100000000):
        _md5.update(foo)

def run_deque_imap_and_repeat_all_local(deque = collections.deque,
        imap = map, _md5 = _md5, repeat = itertools.repeat,
        md5 = hashlib.md5):
    deque(imap(_md5.update, repeat("foo".encode('utf-8'), 100000000)), maxlen = 0)

def run_deque_local_imap_and_repeat(deque = collections.deque,
        imap = map, _md5 = _md5, repeat = itertools.repeat,
        md5 = hashlib.md5):
    deque(imap(_md5.update, repeat("foo".encode('utf-8'), 100000000)), maxlen = 0)

def run_deque_imap_and_repeat():
    collections.deque(map(_md5.update, itertools.repeat("foo".encode('utf-8'), 100000000)),
            maxlen = 0)

def run_any_like_presentation():
    any(map(_md5.update, itertools.repeat("foo".encode('utf-8'), 100000000)))

def run_any_like_presentation_all_local(any = any, deque = collections.deque,
        imap = map, _md5 = _md5, repeat = itertools.repeat):
    any(imap(_md5.update, repeat("foo".encode('utf-8'), 100000000)))

import cProfile
import pstats

def performance_test(a_func):
    cProfile.run(a_func, 'stats')
    p = pstats.Stats('stats')
    p.sort_stats('time').print_stats(10)

performance_test('run_reference()')
performance_test('run_deque_imap_and_repeat_all_local()')
performance_test('run_deque_local_imap_and_repeat()')
performance_test('run_deque_imap_and_repeat()')
performance_test('run_any_like_presentation()')
performance_test('run_any_like_presentation_all_local()')

Another thing, don't do this on a real project unless there is a certifiable performance bottleneck.

And, finally, if we really need a carrot (aside from writing code that makes sense, and isn't prone to error) in those hard times where any actually out performs deque, your more sensible code will be in a better position to take advantage of improvements in newer versions of Python without having to modify your code base.

http://www.python.org/doc/essays/list2str/ is a good read on how to approach Python optimization. (ie. ideally writing a C extension is NOT the first thing you reach for).

I'd also like to point out Gareth's answer as he may be on to why any can out perform deque.

like image 39
Derek Litz Avatar answered Oct 22 '22 10:10

Derek Litz