Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop MapReduce vs MPI (vs Spark vs Mahout vs Mesos) - When to use one over the other?

I am new to parallel computing and just starting to try out MPI and Hadoop+MapReduce on Amazon AWS. But I am confused about when to use one over the other.

For example, one common rule of thumb advice I see can be summarized as...

  • Big data, non-iterative, fault tolerant => MapReduce
  • Speed, small data, iterative, non-Mapper-Reducer type => MPI

But then, I also see implementation of MapReduce on MPI (MR-MPI) which does not provide fault tolerance but seems to be more efficient on some benchmarks than MapReduce on Hadoop, and seems to handle big data using out-of-core memory.

Conversely, there are also MPI implementations (MPICH2-YARN) on new generation Hadoop Yarn with its distributed file system (HDFS).

Besides, there seems to be provisions within MPI (Scatter-Gather, Checkpoint-Restart, ULFM and other fault tolerance) that mimic several features of MapReduce paradigm.

And how does Mahout, Mesos and Spark fit in all this?

What criteria can be used when deciding between (or a combo of) Hadoop MapReduce, MPI, Mesos, Spark and Mahout?

like image 915
GuSuku Avatar asked Jan 06 '15 03:01

GuSuku


People also ask

Is Spark always better than MapReduce?

Apache Spark is well-known for its speed. It runs 100 times faster in-memory and 10 times faster on disk than Hadoop MapReduce. The reason is that Apache Spark processes data in-memory (RAM), while Hadoop MapReduce has to persist data back to the disk after every Map or Reduce action.

In which case MapReduce is better than Spark?

Hadoop MapReduce is meant for data that does not fit in the memory whereas Apache Spark has a better performance for the data that fits in the memory, particularly on dedicated clusters. Apache Spark and Hadoop MapReduce both are failure tolerant but comparatively Hadoop MapReduce is more failure tolerant than Spark.

Does Spark replace MapReduce?

Apache Spark could replace Hadoop MapReduce but Spark needs a lot more memory; however MapReduce kills the processes after job completion; therefore it can easily run with some in-disk memory. Apache Spark performs better with iterative computations when cached data is used repetitively.

What other programming modules can be used to develop big data applications apart from MapReduce and Hadoop framework?

Spark, on the other hand, can run batch workloads as an alternative to MapReduce and also provides higher-level APIs for several other processing use cases.


2 Answers

There might be good technical criteria for this decision but I haven't seen anything published on it. There seems to be a cultural divide where it's understood that MapReduce gets used for sifting through data in corporate environments while scientific workloads use MPI. That may be due to underlying sensitivity of those workloads to network performance. Here are a few thoughts about how to find out:

Many modern MPI implementations can run over multiple networks but are heavily optimized for Infiniband. The canonical use case for MapReduce seems to be in a cluster of "white box" commodity systems connected via ethernet. A quick search on "MapReduce Infiniband" leads to http://dl.acm.org/citation.cfm?id=2511027 which suggests that use of Infiniband in a MapReduce environment is a relatively new thing.

So why would you want to run on a system that's highly optimized for Infiniband? It's significantly more expensive than ethernet but has higher bandwidth, lower latency and scales better in cases of high network contention (ref: http://www.hpcadvisorycouncil.com/pdf/IB_and_10GigE_in_HPC.pdf).

If you have an application that would be sensitive to those effects of optimizations for Infiniband that are already baked into many MPI libraries, maybe that would be useful for you. If your app is relatively insensitive to network performance and spends more time on computations that don't require communication between processes, maybe MapReduce is a better choice.

If you have the opportunity to run benchmarks, you could do a projection on whichever system you have available to see how much improved network performance would help. Try throttling your network: downclock GigE to 100mbit or Infiniband QDR to DDR, for example, draw a line through the results and see if the purchase of a faster interconnect optimized by MPI would get you where you want to go.

like image 79
Aaron Altman Avatar answered Sep 28 '22 03:09

Aaron Altman


The link you posted about FEM being done on MapReduce: http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6188175&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6188175

uses MPI. It states it right there in the abstract. They combined MPI's programming model (non-embarrassingly parallel) with HDFS to "stage" the data to exploit data locality.

Hadoop is purely for embarrassingly parallel computations. Anything that requires processes to organize themselves and exchange data in complex ways will get crap performance with Hadoop. This can be demonstrated both from an algorithmic complexity point of view, and also from a measurement point of view.

like image 40
J.D. Avatar answered Sep 28 '22 02:09

J.D.