Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MapReduce or Spark? [closed]

I have tested hadoop and mapreduce with cloudera and I found it pretty cool, I thought I was the most recent and relevant BigData solution. But few days ago, I found this : https://spark.incubator.apache.org/

A "Lightning fast cluster computing system", able to work on the top of a Hadoop cluster, and apparently able to crush mapreduce. I saw that it worked more in RAM than mapreduce. I think that mapreduce is still relevant when you have to do cluster computing to overcome I/O problems you can have on a single machine. But since Spark can do the jobs that mapreduce do, and may be way more efficient on several operations, isn't it the end of MapReduce ? Or is there something more that MapReduce can do, or can MapReduce be more efficient than Spark in a certain context ?

like image 331
Nosk Avatar asked Mar 04 '14 09:03

Nosk


People also ask

Which is better MapReduce or Spark?

The primary difference between Spark and MapReduce is that Spark processes and retains data in memory for subsequent steps, whereas MapReduce processes data on disk. As a result, for smaller workloads, Spark's data processing speeds are up to 100x faster than MapReduce.

Is MapReduce replaced by Spark?

So when people say that Spark is replacing Hadoop, it actually means that big data professionals now prefer to use Apache Spark for processing the data instead of Hadoop MapReduce. MapReduce and Hadoop are not the same – MapReduce is just a component to process the data in Hadoop and so is Spark.

Is MapReduce still relevant?

In 2021, MapReduce isn't a processing model that will turn many heads. It's been around for well over a decade, and is easy to write off as an idea that might've been groundbreaking in its time, but is dated today. MapReduce was a big deal for a reason, though, and it has a lot to offer, even now.

Is there any benefit of learning MapReduce if Spark is better than MapReduce?

Tasks Spark is good for:In-memory processing makes Spark faster than Hadoop MapReduce – up to 100 times for data in RAM and up to 10 times for data in storage. Iterative processing. If the task is to process data again and again – Spark defeats Hadoop MapReduce.


2 Answers

Depends what you want to do.

MapReduce's greatest strength is processing lots of large text files. Hadoop's implementation is built around string processing, and it's very I/O heavy.

The problem with MapReduce is that people see the easy parallelism hammer and everything starts to look like a nail. Unfortunately Hadoop's performance for anything other than processing large text files is terrible. If you write a decent parallel code you can often have it finish before Hadoop even spawns its first VM. I've seen differences of 100x in my own codes.

Spark eliminates a lot of Hadoop's overheads, such as the reliance on I/O for EVERYTHING. Instead it keeps everything in-memory. Great if you have enough memory, not so great if you don't.

Remember that Spark is an extension of Hadoop, not a replacement. If you use Hadoop to process logs, Spark probably won't help. If you have more complex, maybe tightly-coupled problems then Spark would help a lot. Also, you may like Spark's Scala interface for on-line computations.

like image 142
Adam Avatar answered Sep 17 '22 01:09

Adam


MapReduce is batch oriented in nature. So, any frameworks on top of MR implementations like Hive and Pig are also batch oriented in nature. For iterative processing as in the case of Machine Learning and interactive analysis, Hadoop/MR doesn't meet the requirement. Here is a nice article from Cloudera on Why Spark which summarizes it very nicely.

It's not an end of MR. As of this writing Hadoop is much mature when compared to Spark and a lot of vendors support it. It will change over time. Cloudera has started including Spark in CDH and over time more and more vendors would be including it in their Big Data distribution and providing commercial support for it. We would see MR and Spark in parallel for foreseeable future.

Also with Hadoop 2 (aka YARN), MR and other models (including Spark) can be run on a single cluster. So, Hadoop is not going anywhere.

like image 40
Praveen Sripati Avatar answered Sep 17 '22 01:09

Praveen Sripati