Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is Map/Reduce?

I hear a lot about map/reduce, especially in the context of Google's massively parallel compute system. What exactly is it?

like image 330
Lawrence Dol Avatar asked Dec 23 '08 06:12

Lawrence Dol


People also ask

What is meant by MapReduce?

What is MapReduce? MapReduce is a programming paradigm that enables massive scalability across hundreds or thousands of servers in a Hadoop cluster. As the processing component, MapReduce is the heart of Apache Hadoop. The term "MapReduce" refers to two separate and distinct tasks that Hadoop programs perform.

What is MapReduce and how it works?

MapReduce facilitates concurrent processing by splitting petabytes of data into smaller chunks, and processing them in parallel on Hadoop commodity servers. In the end, it aggregates all the data from multiple servers to return a consolidated output back to the application.

What is MapReduce in big data?

MapReduce is a programming model for processing large data sets with a parallel , distributed algorithm on a cluster (source: Wikipedia). Map Reduce when coupled with HDFS can be used to handle big data.

What is MapReduce and its types?

Mapping is the core technique of processing a list of data elements that come in pairs of keys and values. The map function applies to individual elements defined as key-value pairs of a list and produces a new list.


2 Answers

From the abstract of Google's MapReduce research publication page:

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that the processing can be performed in parallel on multiple processing nodes (multiple servers) so it is a system that can scale very well.

Since it's based from the functional programming model, the map and reduce steps each do not have any side-effects (the state and results from each subsection of a map process does not depend on another), so the data set being mapped and reduced can each be separated over multiple processing nodes.

Joel's Can Your Programming Language Do This? piece discusses how understanding functional programming was essential in Google to come up with MapReduce, which powers its search engine. It's a very good read if you're unfamiliar with functional programming and how it allows scalable code.

See also: Wikipedia: MapReduce

Related question: Please explain mapreduce simply

like image 129
coobird Avatar answered Oct 05 '22 04:10

coobird


Map is a function that applies another function to all the items on a list, to produce another list with all the return values on it. (Another way of saying "apply f to x" is "call f, passing it x". So sometimes it sounds nicer to say "apply" instead of "call".)

This is how map is probably written in C# (it's called Select and is in the standard library):

public static IEnumerable<R> Select<T, R>(this IEnumerable<T> list, Func<T, R> func) {     foreach (T item in list)         yield return func(item); } 

As you're a Java dude, and Joel Spolsky likes to tell GROSSLY UNFAIR LIES about how crappy Java is (actually, he's not lying, it is crappy, but I'm trying to win you over), here's my very rough attempt at a Java version (I have no Java compiler, and I vaguely remember Java version 1.1!):

// represents a function that takes one arg and returns a result public interface IFunctor {     object invoke(object arg); }  public static object[] map(object[] list, IFunctor func) {     object[] returnValues = new object[list.length];      for (int n = 0; n < list.length; n++)         returnValues[n] = func.invoke(list[n]);      return returnValues; } 

I'm sure this can be improved in a million ways. But it's the basic idea.

Reduce is a function that turns all the items on a list into a single value. To do this, it needs to be given another function func that turns two items into a single value. It would work by giving the first two items to func. Then the result of that along with the third item. Then the result of that with the fourth item, and so on until all the items have gone and we're left with one value.

In C# reduce is called Aggregate and is again in the standard library. I'll skip straight to a Java version:

// represents a function that takes two args and returns a result public interface IBinaryFunctor {     object invoke(object arg1, object arg2); }  public static object reduce(object[] list, IBinaryFunctor func) {     if (list.length == 0)         return null; // or throw something?      if (list.length == 1)         return list[0]; // just return the only item      object returnValue = func.invoke(list[0], list[1]);      for (int n = 1; n < list.length; n++)         returnValue = func.invoke(returnValue, list[n]);      return returnValue; } 

These Java versions need generics adding to them, but I don't know how to do that in Java. But you should be able to pass them anonymous inner classes to provide the functors:

string[] names = getLotsOfNames();  string commaSeparatedNames = (string)reduce(names,     new IBinaryFunctor {        public object invoke(object arg1, object arg2)            { return ((string)arg1) + ", " + ((string)arg2); }    } 

Hopefully generics would get rid of the casts. The typesafe equivalent in C# is:

string commaSeparatedNames = names.Aggregate((a, b) => a + ", " + b); 

Why is this "cool"? Simple ways of breaking up larger calculations into smaller pieces, so they can be put back together in different ways, are always cool. The way Google applies this idea is to parallelization, because both map and reduce can be shared out over several computers.

But the key requirement is NOT that your language can treat functions as values. Any OO language can do that. The actual requirement for parallelization is that the little func functions you pass to map and reduce must not use or update any state. They must return a value that is dependent only on the argument(s) passed to them. Otherwise, the results will be completely screwed up when you try to run the whole thing in parallel.

like image 36
Daniel Earwicker Avatar answered Oct 05 '22 04:10

Daniel Earwicker