I have tested hadoop and mapreduce with cloudera and I found it pretty cool, I thought I was the most recent and relevant BigData solution. But few days ago, I found this : https://spark.incubator.apache.org/
A "Lightning fast cluster computing system", able to work on the top of a Hadoop cluster, and apparently able to crush mapreduce. I saw that it worked more in RAM than mapreduce. I think that mapreduce is still relevant when you have to do cluster computing to overcome I/O problems you can have on a single machine. But since Spark can do the jobs that mapreduce do, and may be way more efficient on several operations, isn't it the end of MapReduce ? Or is there something more that MapReduce can do, or can MapReduce be more efficient than Spark in a certain context ?
The primary difference between Spark and MapReduce is that Spark processes and retains data in memory for subsequent steps, whereas MapReduce processes data on disk. As a result, for smaller workloads, Spark's data processing speeds are up to 100x faster than MapReduce.
So when people say that Spark is replacing Hadoop, it actually means that big data professionals now prefer to use Apache Spark for processing the data instead of Hadoop MapReduce. MapReduce and Hadoop are not the same – MapReduce is just a component to process the data in Hadoop and so is Spark.
In 2021, MapReduce isn't a processing model that will turn many heads. It's been around for well over a decade, and is easy to write off as an idea that might've been groundbreaking in its time, but is dated today. MapReduce was a big deal for a reason, though, and it has a lot to offer, even now.
Tasks Spark is good for:In-memory processing makes Spark faster than Hadoop MapReduce – up to 100 times for data in RAM and up to 10 times for data in storage. Iterative processing. If the task is to process data again and again – Spark defeats Hadoop MapReduce.
Depends what you want to do.
MapReduce's greatest strength is processing lots of large text files. Hadoop's implementation is built around string processing, and it's very I/O heavy.
The problem with MapReduce is that people see the easy parallelism hammer and everything starts to look like a nail. Unfortunately Hadoop's performance for anything other than processing large text files is terrible. If you write a decent parallel code you can often have it finish before Hadoop even spawns its first VM. I've seen differences of 100x in my own codes.
Spark eliminates a lot of Hadoop's overheads, such as the reliance on I/O for EVERYTHING. Instead it keeps everything in-memory. Great if you have enough memory, not so great if you don't.
Remember that Spark is an extension of Hadoop, not a replacement. If you use Hadoop to process logs, Spark probably won't help. If you have more complex, maybe tightly-coupled problems then Spark would help a lot. Also, you may like Spark's Scala interface for on-line computations.
MapReduce is batch oriented in nature. So, any frameworks on top of MR implementations like Hive and Pig are also batch oriented in nature. For iterative processing as in the case of Machine Learning and interactive analysis, Hadoop/MR doesn't meet the requirement. Here is a nice article from Cloudera on Why Spark
which summarizes it very nicely.
It's not an end of MR. As of this writing Hadoop is much mature when compared to Spark and a lot of vendors support it. It will change over time. Cloudera has started including Spark in CDH and over time more and more vendors would be including it in their Big Data distribution and providing commercial support for it. We would see MR and Spark in parallel for foreseeable future.
Also with Hadoop 2 (aka YARN), MR and other models (including Spark) can be run on a single cluster. So, Hadoop is not going anywhere.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With