Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Incremental MapReduce implementations (other than CouchDB, preferably)

I work on a project that sits on a large-ish pile of raw data, aggregates from which are used to power a public-facing informational site (some simple aggregates like various totals and top-tens of totals, and some somewhat-more-complicated aggregates). At present we update it once every few months, which involves adding new data, and possibly updating or deleting existing records, and re-running all the aggregation off-line, after which new aggregates are deployed to production.

We're interested in ramping up the frequency of updates, such that re-aggregating everything from scratch isn't practical, so we'd like to do rolling aggregation that updates the existing aggregates to reflect new, changed, or deleted records.

CouchDB's MapReduce implementation offers roughly the facility that I'm looking for: it stores the intermediate state of MapReduce tasks in a big B-tree, where the output of maps is at the leaves, and reduce operations gradually join branches together. New, updated, or deleted records cause subtrees to be marked as dirty and recomputed, but only the relevant portions of the reduce tree need to be touched, and intermediate results from non-dirty subtrees can be re-used as is.

For a variety of reasons, though (uncertainty about CouchDB's future, lack of convenient support for non-MR one-off queries, current SQL-heavy implementation, etc.), we'd prefer not to use CouchDB for this project, so I'm looking for other implementations of this kind of tree-ish incremental map-reduce strategy (possibly, but not necessarily, atop Hadoop or similar).

To pre-empt some possible responses:

  • I'm aware of MongoDB's supposed support for incremental MapReduce; it's not the real thing, in my opinion, because it really only works well for additions to the dataset, not updates or deletes.
  • I'm also aware of the Incoop paper. This describes exactly what I want, but I don't think they've made their implementation public.
like image 658
Andrew Pendleton Avatar asked Jun 18 '13 21:06

Andrew Pendleton


1 Answers

The first thing that comes to mind for me is still Hive, as it now has features such as materialized views, which could hold your aggregates and get selectively invalidated if the underlying data changers.

Though it is not too new, Uber actually published a fairly timeless article on how they approached the challenge of incremental updates, and even the solutions they only reference may be very interesting for this usecase. The key takeaways of this article:

  1. Getting really specific about your actual latency needs can save you tons of money.
  2. Hadoop can solve a lot more problems by employing primitives to support incremental processing.
  3. Unified architectures (code + infrastructure) are the way of the future.

Full disclosure: I am an employee of Cloudera. A provider of Big Data platforms including Hadoop, which contain various tools that are referenced in the article, and also Hive which I referred to directly.

like image 100
Dennis Jaheruddin Avatar answered Nov 07 '22 02:11

Dennis Jaheruddin