Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

itertools.imap vs map over the entire iterable

I'm curious about a statement from http://docs.python.org/2/library/itertools.html#itertools.imap, namely it describes

sum(imap(operator.mul, vector1, vector2))

as an efficient dot-product. My understanding is that imap gives a generator instead of a list, and while I understand how it would be faster/consume less memory if you're only considering the first few elements, with the surrounding sum(), I don't see how it behaves any differently than:

sum(map(operator.mul, vector1, vector2))
like image 477
Joe Avatar asked Dec 31 '13 16:12

Joe


4 Answers

The difference between map and imap becomes clear when you start increasing the size of what you're iterating over:

# xrange object, takes up no memory
data = xrange(1000000000)

# Tries to builds a list of 1 billion elements!
# Therefore, fails with MemoryError on 32-bit systems.
doubled = map(lambda x: x * 2, data)

# Generator object that lazily doubles each item as it's iterated over.
# Takes up very little (and constant, independent of data's size) memory.
iter_doubled = itertools.imap(lambda x: x * 2, data)

# This is where the iteration and the doubling happen.
# Again, since no list is created, this doesn't run you out of memory.
sum(iter_doubled)

# (The result is 999999999000000000L, if you're interested.
# It takes a minute or two to compute, but consumes minimal memory.)

Note that in Python 3, the built-in map behaves like Python 2's itertools.imap (which was removed because it's no longer needed). To get the "old map" behaviour, you'd use list(map(...)), which is another good way to visualize how Python 2's itertools.imap and map differ from each other.

like image 153
Max Noel Avatar answered Nov 06 '22 12:11

Max Noel


The first line will calculate the sum accumulating items one by one. The second one will at first calculate the whole dot-product, and after that, having the whole result in memory it would proceed to calculate the sum. So there is a memory complexity gain.

like image 22
BartoszKP Avatar answered Nov 06 '22 12:11

BartoszKP


One other thing to note is that "uses a lot less memory" often implies "runs faster" too. The lazy (iterator) version consumes each product as soon as it's computed, adding it into the running sum. The product and running sum are both almost certainly in L1 cache. If you compute all the products first, then depending on the number of elements it's certain that the first products computed will be kicked out of L1 cache, and then out of L2 cache, and ... so that when a second pass is made to finally add them together, all the products are low in the memory hierarchy (and in the extreme, have to be read up from a paging file).

But it's unclear to me what you mean by "don't see how it behaves any differently than". The final computed result is the same either way.

like image 39
Tim Peters Avatar answered Nov 06 '22 12:11

Tim Peters


The difference is that the entire output of imap(...) or map(...) is passed to sum(). You're write that imap returns a generator, but I think you might be under the impression that sum(map(...)) has some shortcut that does the same thing. It doesn't. map() will construct an entire list of results before anything gets passed to sum().

like image 23
Kirk Strauser Avatar answered Nov 06 '22 14:11

Kirk Strauser