Questions are summarized here. Yes, I know some of these answers ;) and I can do some hand-waving on others, but I'd really like to get to the nitty-gritty here.
I just watched PyCon 2011: How Dropbox Did It and How Python Helped (admittedly I skipped through most of the parts), but FINALLY the really interesting stuff started at around 22:23.
The speaker advocated making your inner loops in C and that "run once" stuff doesn't need much optimization (make sense)... then he goes on to state... paraphrased:
Pass a composition of iterators to any for massive speed improvements.
Here is the code (hopefully it's identical):
import itertools, hashlib, time
_md5 = hashlib.md5()
def run():
for i in itertools.repeat("foo", 10000000):
_md5.update(i)
a = time.time(); run(); time.time() - a
Out[118]: 9.44077205657959
_md5 = hashlib.md5()
def run():
any(itertools.imap(_md5.update, itertools.repeat("foo", 10000000)))
a = time.time(); run(); time.time() - a
Out[121]: 6.547091007232666
Hmm, looks like for even greater speed improvements I can just get a faster computer! (Judging from his slide.)
He then does a bunch of hand-waving without actually going into details as to why.
I already knew about the iterators from the answer to pythonic way to do something N times without an index variable? thanks to Alex Martelli.
Then I thought, I wonder if the map actually adds speed improvements? My final thought was WTF??? passing to any? REALLY??? Surely that can't be right since the documentation defines any as:
def any(iterable):
for element in iterable:
if element:
return True
return False
Why in the world would passing iterators to any make my code faster?
I then tested it using the following (among many other tests) but this is what gets me:
def print_true(x):
print 'True'
return 'Awesome'
def test_for_loop_over_iter_map_over_iter_repeat():
for result in itertools.imap(print_true, itertools.repeat("foo", 5)):
pass
def run_any_over_iter_map_over_iter_repeat():
any(itertools.imap(print_true, itertools.repeat("foo", 5)))
And the runs:
In [67]: test_for_loop_over_iter_map_over_iter_repeat()
True
True
True
True
True
In [74]: run_any_over_iter_map_over_iter_repeat()
True
For shame. I declared this GUY IS FULL OF IT. Heretic! But, I calmed down and continued to test. If this were true how in the hell could Dropbox even work!?!?
And with further testing it did work... I initially just used a simple counter object, and it counted all the way up to 10000000 in both cases.
So the question is why did my Counter object work and my print_true function fail miserably?
class Counter(object):
count = 0
def count_one(self, none):
self.count += 1
def run_any_counter():
counter = Counter()
any(itertools.imap(counter.count_one, itertools.repeat("foo", 10000000)))
print counter.count
def run_for_counter():
counter = Counter()
for result in itertools.imap(counter.count_one, itertools.repeat("foo", 10000000)):
pass
print counter.count
output:
%time run_for_counter()
10000000
CPU times: user 5.54 s, sys: 0.03 s, total: 5.57 s
Wall time: 5.68 s
%time run_any_counter()
10000000
CPU times: user 5.28 s, sys: 0.02 s, total: 5.30 s
Wall time: 5.40 s
An even bigger WTF is even after removing the unneeded argument and writing the most sensible code for my Counter object, it's STILL slower than the any-map version. Where is my carrot?!?:
class CounterNoArg(object):
count = 0
def count_one(self):
self.count += 1
def straight_count():
counter = CounterNoArg()
for _ in itertools.repeat(None, 10000000):
counter.count_one()
print counter.count
Ouput:
In [111]: %time straight_count()
10000000
CPU times: user 5.44 s, sys: 0.02 s, total: 5.46 s
Wall time: 5.60 s
I'm asking because I think the Pythonistas or Pythoneers need a carrot so we don't start passing stuff to any or all for a performance increase, or does one already exist? Possibly an equivalent to itertools.imap that will just call a function over and over again, and optionally a certain amount of times.
The best I've managed are (using list comprehension gives interesting results):
def super_run():
counter = CounterNoArg()
for _ in (call() for call in itertools.repeat(counter.count_one, 10000000)):
pass
print counter.count
def super_counter_run():
counter = CounterNoArg()
[call() for call in itertools.repeat(counter.count_one, 10000000)]
print counter.count
def run_any_counter():
counter = Counter()
any(itertools.imap(counter.count_one, itertools.repeat("foo", 10000000)))
print counter.count
%time super_run()
10000000
CPU times: user 5.23 s, sys: 0.03 s, total: 5.26 s
Wall time: 5.43 s
%time super_counter_run()
10000000
CPU times: user 4.75 s, sys: 0.18 s, total: 4.94 s
Wall time: 5.80 s
%time run_any_counter()
10000000
CPU times: user 5.15 s, sys: 0.06 s, total: 5.21 s
Wall time: 5.30 s
def run_any_like_presentation():
any(itertools.imap(_md5.update, itertools.repeat("foo", 10000000)))
def super_run_like_presentation():
[do_work for do_work in itertools.imap(_md5.update, itertools.repeat("foo", 10000000))]
def super_run_like_presentation_2():
[_md5.update(foo) for foo in itertools.repeat("foo", 10000000)]
%time run_any_like_presentation()
CPU times: user 5.28 s, sys: 0.02 s, total: 5.29 s
Wall time: 5.47 s
%time super_run_like_presentation()
CPU times: user 6.14 s, sys: 0.18 s, total: 6.33 s
Wall time: 7.56 s
%time super_run_like_presentation_2()
CPU times: user 8.44 s, sys: 0.22 s, total: 8.66 s
Wall time: 9.59 s
Ugh...
Note: I encourage you to run the tests yourself.
In your first example, the first version of run
has to look up _md5.update
each time round the loop, whereas the second version does not. I think you'll find that accounts for most of the performance difference. The rest is probably to do with having to set the local variable i
, though that's not so easy to demonstrate.
import itertools, hashlib, timeit
_md5 = hashlib.md5()
def run1():
for i in itertools.repeat("foo", 10000000):
_md5.update(i)
def run2():
u = _md5.update
for i in itertools.repeat("foo", 10000000):
u(i)
def run3():
any(itertools.imap(_md5.update, itertools.repeat("foo", 10000000)))
>>> timeit.timeit('run1()', 'from __main__ import run1', number=1)
6.081272840499878
>>> timeit.timeit('run2()', 'from __main__ import run2', number=1)
4.660238981246948
>>> timeit.timeit('run3()', 'from __main__ import run3', number=1)
4.062871932983398
The itertools
documentation has a better recipe for consuming an iterator (and discarding all its values): see the consume
function. The use of any
to do this job depends on the fact that _md5.update
always returns None
, so this approach doesn't work in general. Also, the recipe is very slightly faster: [see comments]
import collections
def consume(it):
"Consume iterator completely (discarding its values)."
collections.deque(it, maxlen=0)
def run4():
consume(itertools.imap(_md5.update, itertools.repeat("foo", 10000000)))
>>> timeit.timeit('run4()', 'from __main__ import run4', number=1)
3.969902992248535
Edited to add: it seems that the consume
recipe is not as well known as it should be: if you look at the details of the CPython implementation, you'll see that when collections.deque
is called with maxlen=0
then it calls the function consume_iterator
in _collectionsmodule.c
, which looks like this:
static PyObject*
consume_iterator(PyObject *it)
{
PyObject *item;
while ((item = PyIter_Next(it)) != NULL) {
Py_DECREF(item);
}
Py_DECREF(it);
if (PyErr_Occurred())
return NULL;
Py_RETURN_NONE;
}
To answer the first question about optimizing by passing to any. No, I believe it is not a good idea for the main reason that it is not it's intended purpose. Sure it's easy to implement, but maintenance could become a nightmare. By doing this a new gotcha is introduced into your code base. If the function ever returns false, then the iterator will not be fully consumed causing strange behavior, and bugs that are hard to track down. Also, there exist faster (or at least nearly as fast) alternatives to using the built-in any.
Of course, you can make an exception because it seems any can actually out perform deque, but using any is certainly extreme and most often unnecessary. In fact, if anything, you may be introducing optimizations they may no longer be "optimal" after updates to the Python code base (see 2.7 vs 3.2).
Another thing to mention is this use of any doesn't immediately make any sense. Whether or not to implement a C extension before using any like this is also debatable. Personally, I'd prefer it for semantic reasons.
As far as optimizing your own code let's start with what we're up against: refer to run_any_like_presentation. It's pretty fast :)
An initial implementation could look something like:
import itertools, hashlib
_md5 = hashlib.md5()
def run():
for _ in xrange(100000000):
_md5.update("foo")
The first step is using itertools.repeat to do something N times.
def run_just_repeat():
for foo in itertools.repeat("foo", 100000000):
_md5.update(foo)
The second second optimization is to use itertools.imap to get a speed increase from not having to pass the foo reference in Python code. It is now in C.
def run_imap_and_repeat():
for do_work in itertools.imap(_md5.update, itertools.repeat("foo", 10000000)):
pass
The third optimization is to move the for loop entirely into C code.
import collections
def run_deque_imap_and_repeat():
collections.deque(itertools.imap(_md5.update, itertools.repeat("foo", 10000000)))
The final optimization is to move all potential looks ups into the namespace of the run function:
This idea is taken from the very end of http://docs.python.org/library/itertools.html?highlight=itertools
Note, many of the above recipes can be optimized by replacing global lookups with local variables defined as default values.
Personally, I had mixed success with this showing improvements. ie. Small improvements under certain conditions, from module import xxx also showing performance increases without passing it in as well. Also, sometimes if I pass in some variables, and not others I see slight differences as well. The point is, I feel this one your going to need to test yourself to see if it works for you.
def run_deque_imap_and_repeat_all_local(deque = collections.deque,
imap = itertools.imap, _md5 = _md5, repeat = itertools.repeat,
md5 = hashlib.md5):
update = _md5.update
deque(imap(_md5.update, repeat("foo", 100000000)), maxlen = 0)
And finally to be fair let's implement an any version like the presentation that does the final optimization as well.
def run_any_like_presentation_all_local(any = any, deque = collections.deque,
imap = itertools.imap, _md5 = _md5, repeat = itertools.repeat,
md5 = hashlib.md5):
any(imap(_md5.update, repeat("foo", 100000000)))
Ok now let's run some tests (Python 2.7.2 OS X Snow Leopard 64-bit):
run_reference - 123.913 seconds
run_deque_imap_and_repeat_all_local - 51.201 seconds
run_deque_local_imap_and_repeat - 53.013 seconds
run_deque_imap_and_repeat - 48.913 seconds
run_any_like_presentation - 49.833 seconds
run_any_like_presentation_all_local - 47.780 seconds
And just for kicks in Python3 (Python 3.2 OS X Snow Leopard 64-bit):
run_reference - 94.273 seconds (100000004 function calls!)
run_deque_imap_and_repeat_all_local - 23.929 seconds
run_deque_local_imap_and_repeat - 23.298 seconds
run_deque_imap_and_repeat - 24.201 seconds
run_any_like_presentation - 24.026 seconds
run_any_like_presentation_all_local - 25.316 seconds
Here's my source for the tests:
import itertools, hashlib, collections
_md5 = hashlib.md5()
def run_reference():
for _ in xrange(100000000):
_md5.update("foo")
def run_deque_imap_and_repeat_all_local(deque = collections.deque,
imap = itertools.imap, _md5 = _md5, repeat = itertools.repeat,
md5 = hashlib.md5):
deque(imap(_md5.update, repeat("foo", 100000000)), maxlen = 0)
def run_deque_local_imap_and_repeat(deque = collections.deque,
imap = itertools.imap, _md5 = _md5, repeat = itertools.repeat,
md5 = hashlib.md5):
deque(imap(_md5.update, repeat("foo", 100000000)), maxlen = 0)
def run_deque_imap_and_repeat():
collections.deque(itertools.imap(_md5.update, itertools.repeat("foo", 100000000)),
maxlen = 0)
def run_any_like_presentation():
any(itertools.imap(_md5.update, itertools.repeat("foo", 100000000)))
def run_any_like_presentation_all_local(any = any, deque = collections.deque,
imap = itertools.imap, _md5 = _md5, repeat = itertools.repeat,
md5 = hashlib.md5):
any(imap(_md5.update, repeat("foo", 100000000)))
import cProfile
import pstats
def performance_test(a_func):
cProfile.run(a_func, 'stats')
p = pstats.Stats('stats')
p.sort_stats('time').print_stats(10)
performance_test('run_reference()')
performance_test('run_deque_imap_and_repeat_all_local()')
performance_test('run_deque_local_imap_and_repeat()')
performance_test('run_deque_imap_and_repeat()')
performance_test('run_any_like_presentation()')
performance_test('run_any_like_presentation_all_local()')
And Python3
import itertools, hashlib, collections
_md5 = hashlib.md5()
def run_reference(foo = "foo".encode('utf-8')):
for _ in range(100000000):
_md5.update(foo)
def run_deque_imap_and_repeat_all_local(deque = collections.deque,
imap = map, _md5 = _md5, repeat = itertools.repeat,
md5 = hashlib.md5):
deque(imap(_md5.update, repeat("foo".encode('utf-8'), 100000000)), maxlen = 0)
def run_deque_local_imap_and_repeat(deque = collections.deque,
imap = map, _md5 = _md5, repeat = itertools.repeat,
md5 = hashlib.md5):
deque(imap(_md5.update, repeat("foo".encode('utf-8'), 100000000)), maxlen = 0)
def run_deque_imap_and_repeat():
collections.deque(map(_md5.update, itertools.repeat("foo".encode('utf-8'), 100000000)),
maxlen = 0)
def run_any_like_presentation():
any(map(_md5.update, itertools.repeat("foo".encode('utf-8'), 100000000)))
def run_any_like_presentation_all_local(any = any, deque = collections.deque,
imap = map, _md5 = _md5, repeat = itertools.repeat):
any(imap(_md5.update, repeat("foo".encode('utf-8'), 100000000)))
import cProfile
import pstats
def performance_test(a_func):
cProfile.run(a_func, 'stats')
p = pstats.Stats('stats')
p.sort_stats('time').print_stats(10)
performance_test('run_reference()')
performance_test('run_deque_imap_and_repeat_all_local()')
performance_test('run_deque_local_imap_and_repeat()')
performance_test('run_deque_imap_and_repeat()')
performance_test('run_any_like_presentation()')
performance_test('run_any_like_presentation_all_local()')
Another thing, don't do this on a real project unless there is a certifiable performance bottleneck.
And, finally, if we really need a carrot (aside from writing code that makes sense, and isn't prone to error) in those hard times where any actually out performs deque, your more sensible code will be in a better position to take advantage of improvements in newer versions of Python without having to modify your code base.
http://www.python.org/doc/essays/list2str/ is a good read on how to approach Python optimization. (ie. ideally writing a C extension is NOT the first thing you reach for).
I'd also like to point out Gareth's answer as he may be on to why any can out perform deque.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With