I'm new to Python,
Do someone know what's relationships between Python (and functional languages') functions map()
/ reduce()
and MapReduce concept related to distributed computations?
The cloud concept of map/reduce is very similar, but changed to work in parallel. First, each data object is passed through a function that map
s it to a new object (usually, some sort of dictionary). Then, a reduce
function is called on pairs of the objects returned by map
until there is only one left. That is the result of the map/reduce operation.
One important consideration is that, because of the parallelization, the reduce
function must be able to take in objects from the map
function as well as objects from prior reduce
functions. This makes more sense when you think about how the parallelization goes. Many machines will each reduce their data to a single object, and those objects will then be reduced to a final output. Of course, this may happen in more than one layer if there is a lot of data.
Here's a simple example of how you might use the map/reduce framework to count words in a list:
list = ['a', 'foo', 'bar', 'foobar', 'foo', 'a', 'bar', 'bar', 'bar', 'bar', 'foo']
list2 = ['b', 'foo', 'foo', 'b', 'a', 'bar']
The map function would look like this:
def wordToDict(word):
return {word: 1}
And the reduce function would look like this:
def countReduce(d1, d2):
out = d1.copy()
for key in d2:
if key in out:
out[key] += d2[key]
else:
out[key] = d2[key]
return out
Then you can map/reduce like this:
reduce(countReduce, map(wordToDict, list + list2))
>>> {'a': 3, 'foobar': 1, 'b': 2, 'bar': 6, 'foo': 5}
But you can also do it like this (which is what parallelization would do):
reduce(countReduce, [reduce(countReduce, map(wordToDict, list)), reduce(countReduce, map(wordToDict, list2))])
>>> {'a': 3, 'foobar': 1, 'b': 2, 'foo': 5, 'bar': 6}
Actually these concepts are somewhat different and common names are misleading.
In functional programming (where Python borrowed these functions):
map
applies some function to all elements of a list and returns a new listreduce
applies some function to aggregate all values in some list to get a single value.In distributed computations MapReduce:
Note that neither mapper always produce one output pair for each input pair nor reducer always reduces every (key, list of values) to exactly one output pair. Mapper and reducer can output whatever they want. For example mapper can be used to filter pairs - in this case it produces output pair for some input pairs and ignores other. It is not also uncommon to yield more than one pair for each mapper/reducer input pair (or for some of them).
But in most cases MapReduce can work in similar or almost similar way as reduce(reduce_function, map(map_function, list))
- mapper usually does some computation for each input, and reducer usually aggregates a list of values in some way. For any map_function
and reduce_function
, it is possible to express this in MapReduce, but not vice versa.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With