I'm curious about a statement from http://docs.python.org/2/library/itertools.html#itertools.imap, namely it describes
sum(imap(operator.mul, vector1, vector2))
as an efficient dot-product. My understanding is that imap gives a generator instead of a list, and while I understand how it would be faster/consume less memory if you're only considering the first few elements, with the surrounding sum(), I don't see how it behaves any differently than:
sum(map(operator.mul, vector1, vector2))
The difference between map
and imap
becomes clear when you start increasing the size of what you're iterating over:
# xrange object, takes up no memory
data = xrange(1000000000)
# Tries to builds a list of 1 billion elements!
# Therefore, fails with MemoryError on 32-bit systems.
doubled = map(lambda x: x * 2, data)
# Generator object that lazily doubles each item as it's iterated over.
# Takes up very little (and constant, independent of data's size) memory.
iter_doubled = itertools.imap(lambda x: x * 2, data)
# This is where the iteration and the doubling happen.
# Again, since no list is created, this doesn't run you out of memory.
sum(iter_doubled)
# (The result is 999999999000000000L, if you're interested.
# It takes a minute or two to compute, but consumes minimal memory.)
Note that in Python 3, the built-in map
behaves like Python 2's itertools.imap
(which was removed because it's no longer needed). To get the "old map
" behaviour, you'd use list(map(...))
, which is another good way to visualize how Python 2's itertools.imap
and map
differ from each other.
The first line will calculate the sum accumulating items one by one. The second one will at first calculate the whole dot-product, and after that, having the whole result in memory it would proceed to calculate the sum. So there is a memory complexity gain.
One other thing to note is that "uses a lot less memory" often implies "runs faster" too. The lazy (iterator) version consumes each product as soon as it's computed, adding it into the running sum. The product and running sum are both almost certainly in L1 cache. If you compute all the products first, then depending on the number of elements it's certain that the first products computed will be kicked out of L1 cache, and then out of L2 cache, and ... so that when a second pass is made to finally add them together, all the products are low in the memory hierarchy (and in the extreme, have to be read up from a paging file).
But it's unclear to me what you mean by "don't see how it behaves any differently than". The final computed result is the same either way.
The difference is that the entire output of imap(...)
or map(...)
is passed to sum()
. You're write that imap
returns a generator, but I think you might be under the impression that sum(map(...))
has some shortcut that does the same thing. It doesn't. map()
will construct an entire list of results before anything gets passed to sum()
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With