Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mongodb map reduce vs Apache Spark map reduce

I have use-case in which I have 3M records in my Mongodb.

I want to aggregate data based on some condition.

I found two ways to accomplish it

  • Using Mongodb map reduce function query
  • Using Apache Spark map reduce function by connecting Mongodb to to spark.

I successfully executed my use-case using the above methods and found similar performance of both.

My query is ?

Does Mongodb and Apache Spark use the same Map reduce algorithm and which method (M.R using Spark or native Mongodb map reduce) is more efficient ?

like image 341
Prakash P Avatar asked Mar 23 '26 00:03

Prakash P


1 Answers

Does Mongodb and Apache Spark use the same Map reduce algorithm and which method (M.R using Spark or native Mongodb map reduce) is more efficient ?

In the broad sense of map-reduce algorithm, yes. Although implementation wise they are different (i.e. JavaScript vs Java Jar)

If your question is more about finding out suitability of the two for your use case, you should consider from other aspects. Especially if for your use case, you've found both to be similar in performance. Let's explore below:

Assuming that you have the resources (time, money, servers) and expertise to maintain an Apache Spark cluster along side MongoDB cluster, then having a separate processing framework (Spark) and data storage (MongoDB) is ideal. Maintaining CPU/RAM resources only for database querying in MongoDB servers, and CPU/RAM resources only for intensive ETL in Spark nodes. Afterward write the result of the processing back into MongoDB.

If you are using MongoDB Connector for Apache Spark, you can take advantage of Aggregation Pipeline and (secondary) indexes to do ETL only the range of data Spark needs. As opposed to pulling unnecessary data to Spark nodes, which means more processing overhead, hardware requirements, network-latency.

You may find the following resources useful:

  • MongoDB Connector for Spark: Getting started - contains example for aggregation.
  • MongoDB Spark Connector Java API
  • M233: Getting started with Spark and MongoDB - free online course

If you don't have the resources and expertise to maintain a Spark cluster, then keep it in MongoDB. Worth mentioning that for most aggregation operations,  the Aggregation Pipeline provides better performance and more coherent interface than MongoDB's map-reduce. If you can convert your map-reduce into an aggregation pipeline, I would recommend you to do so. Also see Aggregation Pipeline Optimisation for extra optimisation tips.

If your use case doesn't require a real-time processing, you can configure delayed or hidden node of MongoDB Replica Set. Which will serve as a dedicated server/instance for your aggregation/map-reduce processing. Separating the processing node(s) and data-storage node(s). See also Replica Set Architectures.

like image 95
Wan Bachtiar Avatar answered Mar 25 '26 14:03

Wan Bachtiar



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!