Mongodb map reduce vs Apache Spark map reduce

Question

I have use-case in which I have 3M records in my Mongodb.

I want to aggregate data based on some condition.

I found two ways to accomplish it

Using Mongodb map reduce function query
Using Apache Spark map reduce function by connecting Mongodb to to spark.

I successfully executed my use-case using the above methods and found similar performance of both.

My query is ?

Does Mongodb and Apache Spark use the same Map reduce algorithm and which method (M.R using Spark or native Mongodb map reduce) is more efficient ?

Wan Bachtiar · Accepted Answer

Does Mongodb and Apache Spark use the same Map reduce algorithm and which method (M.R using Spark or native Mongodb map reduce) is more efficient ?

In the broad sense of map-reduce algorithm, yes. Although implementation wise they are different (i.e. JavaScript vs Java Jar)

If your question is more about finding out suitability of the two for your use case, you should consider from other aspects. Especially if for your use case, you've found both to be similar in performance. Let's explore below:

Assuming that you have the resources (time, money, servers) and expertise to maintain an Apache Spark cluster along side MongoDB cluster, then having a separate processing framework (Spark) and data storage (MongoDB) is ideal. Maintaining CPU/RAM resources only for database querying in MongoDB servers, and CPU/RAM resources only for intensive ETL in Spark nodes. Afterward write the result of the processing back into MongoDB.

If you are using MongoDB Connector for Apache Spark, you can take advantage of Aggregation Pipeline and (secondary) indexes to do ETL only the range of data Spark needs. As opposed to pulling unnecessary data to Spark nodes, which means more processing overhead, hardware requirements, network-latency.

You may find the following resources useful:

MongoDB Connector for Spark: Getting started - contains example for aggregation.
MongoDB Spark Connector Java API
M233: Getting started with Spark and MongoDB - free online course

If you don't have the resources and expertise to maintain a Spark cluster, then keep it in MongoDB. Worth mentioning that for most aggregation operations, the Aggregation Pipeline provides better performance and more coherent interface than MongoDB's map-reduce. If you can convert your map-reduce into an aggregation pipeline, I would recommend you to do so. Also see Aggregation Pipeline Optimisation for extra optimisation tips.

If your use case doesn't require a real-time processing, you can configure delayed or hidden node of MongoDB Replica Set. Which will serve as a dedicated server/instance for your aggregation/map-reduce processing. Separating the processing node(s) and data-storage node(s). See also Replica Set Architectures.

Mongodb map reduce vs Apache Spark map reduce

Tags:

java

mongodb

apache-spark

hadoop

Prakash P

1 Answers

Wan Bachtiar

Recent Activity

Donate For Us

Mongodb map reduce vs Apache Spark map reduce

Tags:

java

mongodb

apache-spark

hadoop

Prakash P

1 Answers

Wan Bachtiar

Related questions

Recent Activity

Donate For Us