Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

State of Map-Reduce on Appengine?

There is appengine-mapreduce which seems the official way to do things on AppEngine. But there seems no documentation besides some hacked together Wiki Pages and lengthy videos. There are statements that the lib only supports the map step. But the source indicates that there are also implementations for shuffle.

A Version of this appengine-mapreduce library seems also to be included in the SDK but it not blessed for public use. So you basically are expected to load the library twice into your runtime.

Then there is appengine-pipeline. "A primary use-case of the API is connecting together various App Engine MapReduces into a computational pipeline." But there also seems pipeline-related code in the appengine-mapreduce library.

So where do I start to find out how this all fits together? Which is the library to call from my project. Is there any decent documentation on appengine-mapreduce besides parsing change logs?

like image 943
max Avatar asked Dec 07 '11 07:12

max


1 Answers

Which is the library to call from my project.

They serve different purposes, and you've provided no details about what you're attempting to do.

The most fundamental layer here is the task queue, which lets you schedule background work that can be highly parallelized. This is fan-out. Let's say you had a list of 1000 websites, and you wanted to check the response time for each one and send an email for any site that takes more than 5 seconds to load. By running these as concurrent tasks, you can complete the work much faster than if you checked all 1000 sites in sequence.

Now let's say you don't want to send an email for every slow site, you just want to check all 1000 sites and send one summary email that says how many took more than 5 seconds and how many took fewer. This is fan-in. It's trickier with the task queue, because you need to know when all tasks have completed, and you need to collect and summarize their results.

Enter the Pipeline API. The Pipeline API abstracts the task queue to make fan-in easier. You write what looks like synchronous, procedural code, but uses Python futures and is executed (as much as possible) in parallel. The Pipeline API keeps track of task dependencies and collects results to facilitate building distributed workflows.

The MapReduce API wraps the Pipeline API to facilitate a specific type of distributed workflow: mapping the results of a piece of work into a set of key/value pairs, and reducing multiple sets of results to one by combining their values.

So they provide increasing layers of abstraction and convenience around a common system of distributed task execution. The right solution depends on what you're trying to accomplish.

like image 79
Drew Sears Avatar answered Oct 22 '22 03:10

Drew Sears