Hadoop MapReduce vs MPI (vs Spark vs Mahout vs Mesos) - When to use one over the other?

Tags:

I am new to parallel computing and just starting to try out MPI and Hadoop+MapReduce on Amazon AWS. But I am confused about when to use one over the other.

For example, one common rule of thumb advice I see can be summarized as...

Big data, non-iterative, fault tolerant => MapReduce
Speed, small data, iterative, non-Mapper-Reducer type => MPI

But then, I also see implementation of MapReduce on MPI (MR-MPI) which does not provide fault tolerance but seems to be more efficient on some benchmarks than MapReduce on Hadoop, and seems to handle big data using out-of-core memory.

Conversely, there are also MPI implementations (MPICH2-YARN) on new generation Hadoop Yarn with its distributed file system (HDFS).

Besides, there seems to be provisions within MPI (Scatter-Gather, Checkpoint-Restart, ULFM and other fault tolerance) that mimic several features of MapReduce paradigm.

And how does Mahout, Mesos and Spark fit in all this?

What criteria can be used when deciding between (or a combo of) Hadoop MapReduce, MPI, Mesos, Spark and Mahout?

915

asked Jan 06 '15 03:01

GuSuku

2 Answers

There might be good technical criteria for this decision but I haven't seen anything published on it. There seems to be a cultural divide where it's understood that MapReduce gets used for sifting through data in corporate environments while scientific workloads use MPI. That may be due to underlying sensitivity of those workloads to network performance. Here are a few thoughts about how to find out:

Many modern MPI implementations can run over multiple networks but are heavily optimized for Infiniband. The canonical use case for MapReduce seems to be in a cluster of "white box" commodity systems connected via ethernet. A quick search on "MapReduce Infiniband" leads to http://dl.acm.org/citation.cfm?id=2511027 which suggests that use of Infiniband in a MapReduce environment is a relatively new thing.

So why would you want to run on a system that's highly optimized for Infiniband? It's significantly more expensive than ethernet but has higher bandwidth, lower latency and scales better in cases of high network contention (ref: http://www.hpcadvisorycouncil.com/pdf/IB_and_10GigE_in_HPC.pdf).

If you have an application that would be sensitive to those effects of optimizations for Infiniband that are already baked into many MPI libraries, maybe that would be useful for you. If your app is relatively insensitive to network performance and spends more time on computations that don't require communication between processes, maybe MapReduce is a better choice.

If you have the opportunity to run benchmarks, you could do a projection on whichever system you have available to see how much improved network performance would help. Try throttling your network: downclock GigE to 100mbit or Infiniband QDR to DDR, for example, draw a line through the results and see if the purchase of a faster interconnect optimized by MPI would get you where you want to go.

answered Sep 28 '22 03:09

Aaron Altman

The link you posted about FEM being done on MapReduce: http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6188175&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6188175

uses MPI. It states it right there in the abstract. They combined MPI's programming model (non-embarrassingly parallel) with HDFS to "stage" the data to exploit data locality.

Hadoop is purely for embarrassingly parallel computations. Anything that requires processes to organize themselves and exchange data in complex ways will get crap performance with Hadoop. This can be demonstrated both from an algorithmic complexity point of view, and also from a measurement point of view.

answered Sep 28 '22 02:09

J.D.

Related questions
                            
                                NotSerializableException on anonymous class
                            
                                Why does "hadoop fs -mkdir" fail with Permission Denied?
                            
                                Sqoop Import --password-file function not working properly in sqoop 1.4.4
                            
                                Hadoop “Unable to load native-hadoop library for your platform” error on docker-spark?
                            
                                Hive enforces schema during read time?
                            
                                Hadoop 2.2.0 fails running start-dfs.sh with Error: JAVA_HOME is not set and could not be found
                            
                                Hadoop: How to unit test FileSystem
                            
                                Getting the following error "Datanode denied communication with namenode" while configuring hadoop 0.23.8
                            
                                Type mismatch in value from map: expected org.apache.hadoop.io.NullWritable, recieved org.apache.hadoop.io.Text
                            
                                Sampling a large distributed data set using pyspark / spark
                            
                                Hadoop: Cannot use Jps command
                            
                                Difference between Hadoop and Nosql [closed]
                            
                                Hadoop fs lookup for block size?
                            
                                Hadoop on MAC pseudo node : nodename nor servname provided, or not known
                            
                                Split size vs Block size in Hadoop
                            
                                Container killed by the ApplicationMaster Exit code is 143
                            
                                Hadoop on EC2 vs Elastic Map Reduce
                            
                                How does Apache Spark know about HDFS data nodes?
                            
                                hadoop connection refused on port 9000
                            
                                How does Hive choose the number of reducers for a job?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Hadoop MapReduce vs MPI (vs Spark vs Mahout vs Mesos) - When to use one over the other?

Tags:

parallel-processing

hadoop

mapreduce

mpi

GuSuku

People also ask

2 Answers

Aaron Altman

J.D.

Recent Activity

Donate For Us