MapReduce or Spark? [closed]

Tags:

I have tested hadoop and mapreduce with cloudera and I found it pretty cool, I thought I was the most recent and relevant BigData solution. But few days ago, I found this : https://spark.incubator.apache.org/

A "Lightning fast cluster computing system", able to work on the top of a Hadoop cluster, and apparently able to crush mapreduce. I saw that it worked more in RAM than mapreduce. I think that mapreduce is still relevant when you have to do cluster computing to overcome I/O problems you can have on a single machine. But since Spark can do the jobs that mapreduce do, and may be way more efficient on several operations, isn't it the end of MapReduce ? Or is there something more that MapReduce can do, or can MapReduce be more efficient than Spark in a certain context ?

331

asked Mar 04 '14 09:03

Nosk

2 Answers

Depends what you want to do.

MapReduce's greatest strength is processing lots of large text files. Hadoop's implementation is built around string processing, and it's very I/O heavy.

The problem with MapReduce is that people see the easy parallelism hammer and everything starts to look like a nail. Unfortunately Hadoop's performance for anything other than processing large text files is terrible. If you write a decent parallel code you can often have it finish before Hadoop even spawns its first VM. I've seen differences of 100x in my own codes.

Spark eliminates a lot of Hadoop's overheads, such as the reliance on I/O for EVERYTHING. Instead it keeps everything in-memory. Great if you have enough memory, not so great if you don't.

Remember that Spark is an extension of Hadoop, not a replacement. If you use Hadoop to process logs, Spark probably won't help. If you have more complex, maybe tightly-coupled problems then Spark would help a lot. Also, you may like Spark's Scala interface for on-line computations.

142

answered Sep 17 '22 01:09

Adam

MapReduce is batch oriented in nature. So, any frameworks on top of MR implementations like Hive and Pig are also batch oriented in nature. For iterative processing as in the case of Machine Learning and interactive analysis, Hadoop/MR doesn't meet the requirement. Here is a nice article from Cloudera on Why Spark which summarizes it very nicely.

It's not an end of MR. As of this writing Hadoop is much mature when compared to Spark and a lot of vendors support it. It will change over time. Cloudera has started including Spark in CDH and over time more and more vendors would be including it in their Big Data distribution and providing commercial support for it. We would see MR and Spark in parallel for foreseeable future.

Also with Hadoop 2 (aka YARN), MR and other models (including Spark) can be run on a single cluster. So, Hadoop is not going anywhere.

answered Sep 17 '22 01:09

Praveen Sripati

Related questions
                            
                                How do I replace a string value with a NULL in PySpark?
                            
                                SparkSQL - Read parquet file directly
                            
                                How to make shark/spark clear the cache?
                            
                                IllegalAccessError to guava's StopWatch from org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus
                            
                                PySpark Logging?
                            
                                Merge Spark output CSV files with a single header
                            
                                Reading multiple files from S3 in Spark by date period
                            
                                Spark: Difference between Shuffle Write, Shuffle spill (memory), Shuffle spill (disk)?
                            
                                Convert a simple one line string to RDD in Spark
                            
                                What are broadcast variables? What problems do they solve?
                            
                                How to avoid generating crc files and SUCCESS files while saving a DataFrame?
                            
                                How to create SparkSession with Hive support (fails with "Hive classes are not found")?
                            
                                Fill in null with previously known good value with pyspark
                            
                                Count the distinct elements of each group by other field on a Spark 1.6 Dataframe
                            
                                Dataframe sample in Apache spark | Scala
                            
                                What's the meaning of DStream.foreachRDD function?
                            
                                Python script scheduling in airflow
                            
                                How to read input from S3 in a Spark Streaming EC2 cluster application
                            
                                How to get element by Index in Spark RDD (Java)
                            
                                How to get Kafka offsets for structured query for manual and reliable offset management?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

MapReduce or Spark? [closed]

Tags:

apache-spark

hadoop

mapreduce

Nosk

People also ask

2 Answers

Adam

Praveen Sripati

Recent Activity

Donate For Us