Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Relationship between python map reduce and cloud-computing map/reduce?

I'm new to Python,

Do someone know what's relationships between Python (and functional languages') functions map() / reduce() and MapReduce concept related to distributed computations?

like image 207
McAgee Avatar asked Nov 14 '11 04:11

McAgee


2 Answers

The cloud concept of map/reduce is very similar, but changed to work in parallel. First, each data object is passed through a function that maps it to a new object (usually, some sort of dictionary). Then, a reduce function is called on pairs of the objects returned by map until there is only one left. That is the result of the map/reduce operation.

One important consideration is that, because of the parallelization, the reduce function must be able to take in objects from the map function as well as objects from prior reduce functions. This makes more sense when you think about how the parallelization goes. Many machines will each reduce their data to a single object, and those objects will then be reduced to a final output. Of course, this may happen in more than one layer if there is a lot of data.

Here's a simple example of how you might use the map/reduce framework to count words in a list:

list = ['a', 'foo', 'bar', 'foobar', 'foo', 'a', 'bar', 'bar', 'bar', 'bar', 'foo']
list2 = ['b', 'foo', 'foo', 'b', 'a', 'bar']

The map function would look like this:

def wordToDict(word):
  return {word: 1}

And the reduce function would look like this:

def countReduce(d1, d2):
  out = d1.copy()
  for key in d2: 
    if key in out:
      out[key] += d2[key]
    else:
      out[key] = d2[key]
  return out 

Then you can map/reduce like this:

reduce(countReduce, map(wordToDict, list + list2))

>>> {'a': 3, 'foobar': 1, 'b': 2, 'bar': 6, 'foo': 5}

But you can also do it like this (which is what parallelization would do):

reduce(countReduce, [reduce(countReduce, map(wordToDict, list)), reduce(countReduce, map(wordToDict, list2))])

>>> {'a': 3, 'foobar': 1, 'b': 2, 'foo': 5, 'bar': 6}
like image 136
Aaron Dufour Avatar answered Nov 15 '22 23:11

Aaron Dufour


Actually these concepts are somewhat different and common names are misleading.

In functional programming (where Python borrowed these functions):

  • map applies some function to all elements of a list and returns a new list
  • reduce applies some function to aggregate all values in some list to get a single value.

In distributed computations MapReduce:

  • we are always working with key-value pairs (well, just pairs)
  • mapper takes a list of pairs and produces another list of pairs ("key" of input loses its semantics in this context)
  • reducer gets a key and list of values corresponding to this key (from mapper output) and produces some list of keys and values (single place where "key" has key semantics is the reducer input/mapper output: values are grouped by key before passing to reducer)
  • you may also have partitioner and combiner here :)

Note that neither mapper always produce one output pair for each input pair nor reducer always reduces every (key, list of values) to exactly one output pair. Mapper and reducer can output whatever they want. For example mapper can be used to filter pairs - in this case it produces output pair for some input pairs and ignores other. It is not also uncommon to yield more than one pair for each mapper/reducer input pair (or for some of them).

But in most cases MapReduce can work in similar or almost similar way as reduce(reduce_function, map(map_function, list)) - mapper usually does some computation for each input, and reducer usually aggregates a list of values in some way. For any map_function and reduce_function, it is possible to express this in MapReduce, but not vice versa.

like image 41
Alexey Tigarev Avatar answered Nov 16 '22 00:11

Alexey Tigarev