Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When to use map reduce over Aggregation Pipeline in MongoDB?

While looking at documentation for map-reduce, I found that:

NOTE:

For most aggregation operations, the Aggregation Pipeline provides better performance and more coherent interface. However, map-reduce operations provide some flexibility that is not presently available in the aggregation pipeline.

I did not understand much from it.

  • What are the use cases for using map-reduce over aggregation pipeline?
  • What flexibility does map-reduce provide?
  • How much delta is there in performance?
like image 649
Dev Avatar asked May 22 '15 08:05

Dev


People also ask

What is the difference between map-reduce function and aggregate function?

Map-reduce is a common pattern when working with Big Data – it's a way to extract info from a huge dataset. But now, starting with version 2.2, MongoDB includes a new feature called Aggregation framework. Functionality-wise, Aggregation is equivalent to map-reduce but, on paper, it promises to be much faster.

What is the use of map-reduce in MongoDB?

In MongoDB, map-reduce is a data processing programming model that helps to perform operations on large data sets and produce aggregated results. MongoDB provides the mapReduce() function to perform the map-reduce operations. This function has two main functions, i.e., map function and reduce function.

Can mapReduce be used for aggregation?

Map-reduce operations can be rewritten using aggregation pipeline operators, such as $group , $merge , and others. For map-reduce operations that require custom functionality, MongoDB provides the $accumulator and $function aggregation operators starting in version 4.4.

Which aggregation method is preferred for use by MongoDB?

The pipeline provides efficient data aggregation using native operations within MongoDB, and is the preferred method for data aggregation in MongoDB. The aggregation pipeline can operate on a sharded collection.


1 Answers

For one thing, Map/Reduce in MongoDB wasn't made for ad-hoc queries, there's considerable overhead to M/R. Even a very simple M/R operation on a small dataset can take in the hundreds of milliseconds because of that overhead.

I can't say much about the performance of M/R compared to the aggregation framework on large datasets in practice, but in theory, M/R operations on a large sharded database should be faster since the shards can run the operations largely in parallel.

As to the flexibility, since M/R actually runs javascript methods you have the full power of the language at your disposal. For example, let's say you wanted to group some data by the cosine of a field's value. Since there's neither a $cos operator in the aggregation framework, nor a meaningful way to build discrete buckets from continuous numbers (something like $truncate), the aggregation framework wouldn't help in that case.

So, in a nutshell, I'd say the use cases are

  • keeping the results of M/R in a separate collection and updating it from time to time (using the out parameter and merging the results)
  • Complex queries on large sharded data sets
  • Queries that are so complex that you can't use the aggregation framework. I'd say that's a pretty certain sign of a design flaw in the data structure, but in principle, it can help
like image 59
mnemosyn Avatar answered Oct 24 '22 03:10

mnemosyn