Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is Spark faster than Hadoop Map Reduce

Can someone explain using the word count example, why Spark would be faster than Map Reduce?

like image 290
Victor Avatar asked Sep 14 '15 19:09

Victor


1 Answers

bafna's answer provides the memory-side of the story, but I want to add other two important facts:DAG and ecosystem

  1. Spark uses "lazy evaluation" to form a directed acyclic graph (DAG) of consecutive computation stages. In this way, the execution plan can be optimized, e.g. to minimize shuffling data around. In contrast, this should be done manually in MapReduce by tuning each MR step. (It would be easier to understand this point if you are familiar with the execution plan optimization in RDBMS or the DAG-style execution of Apache Tez)
  2. Spark ecosystem has established a versatile stack of components to handle SQL, ML, Streaming, Graph Mining tasks. But in the hadoop ecosystem you have to install other packages to do these individual things.

And I want to add that, even if your data is too big for main memory, you can still use spark by choosing to persist you data on disk. Although by doing this you give up the advantages of in-memory processing, you can still benefit from the DAG execution optimization.

Some informative answers on Quora: here and here.

like image 61
zaxliu Avatar answered Oct 04 '22 00:10

zaxliu