Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is MapReduce in CouchDB called "incremental"?

I am reading the O'Reilly CouchDB book. I am puzzled by the reduce/re-reduce/incremental-MapReduce part on page 64. Too much is left to rhetory in the O'Reilly book with the sentence

If you're interested in pushing the ede of CouchDB's incremental reduce functionality, have a look at Google's paper on Sawzall, ...

If I understand the word "incremental" correctly, it refers to some sort of addition -operation in the B-tree data structure. I cannot yet see why it is somehow special over typical map-reduce, probably not yet understanding it. In CouchDB, it mentions that there is no side-effects with map function - does that hold true with reduce too?

Why is MapReduce in CouchDB is called "incremental"?

Helper questions

  1. Explain the quote about incremental MapReduce with Sawzall.
  2. Why two terms for the same thing i.e. reduction? Reduce and re-reduce?

References

  1. A Google paper about Sawzall.
  2. Introduction to CouchDB views in the CouchDB wiki and a lot of blurry blog references.
  3. CouchDB O'Reilly book
like image 787
hhh Avatar asked Jun 28 '12 00:06

hhh


People also ask

What is incremental MapReduce?

The incremental MapReduce framework can be applied to several fields such as web crawls, PageRanking, life science computing, graph processing, text processing, machine learning, data mining, and relational data processing.

What is MapReduce in CouchDB?

In order to retrieve data with CouchDB, we use a process called MapReduce, to create views. A view contains rows of data that is sorted by the row's key (you might use date as a key, for example, to sort your data based on the date). MapReduce is a combination of two concepts Map and Reduce.


1 Answers

This page that you linked explained it.

The view (which is the whole point of map reduce in CouchDB) can be updated by re-indexing only the documents that have changed since the last index update. That's the incremental part.

This can be achieved by requiring the reduce function to be referentially transparent, which means that it always returns the same output for a given input.

The reduce function also must be commutative and associative for the array value input, which means that if you run the reducer on the output of that same reducer, you will receive the same result. In that wiki page it is expressed like:

f(Key, Values) == f(Key, [ f(Key, Values) ] )

Rereduce is where you take the output from several reducer calls and run that through the reducer again. This sometimes is required because CouchDB sends stuff through the reducer in batches, so sometimes not all keys that need to be reduced will be sent through in once shot.

like image 197
nickgroenke Avatar answered Sep 29 '22 17:09

nickgroenke