MapReduce or Spark for Batch processing on Hadoop?

Tags:

I know that MapReduce is a great framework for batch processing on Hadoop. But, Spark also can be used as batch framework on Hadoop that provides scalability, fault tolerance and high performance compared MapReduce. Cloudera, Hortonworks and MapR started supporting Spark on Hadoop with YARN as well.

But, a lot of companies are still using MapReduce Framework on Hadoop for batch processing instead of Spark.

So, I am trying to understand what are the current challenges of Spark to be used as batch processing framework on Hadoop?

Any thoughts?

284

asked Oct 30 '14 17:10

Venkat Ankam

2 Answers

I'm assuming when you say Hadoop you mean HDFS.

There are number of benefits of using Spark over Hadoop MR.

Performance: Spark is at least as fast as Hadoop MR. For iterative algorithms (that need to perform number of iterations of the same dataset) is can be a few orders of magnitude faster. Map-reduce writes the output of each stage to HDFS.

1.1. Spark can cache (depending on the available memory) this intermediate results and therefore reduce latency due to disk IO.

1.2. Spark operations are lazy. This means Spark can perform certain optimizing before it starts processing the data because it can reorder operations because they have executed yet.

1.3. Spark keeps a lineage of operations and recreates the partial failed state based on this lineage in case of failure.
Unified Ecosystem: Spark provides a unified programming model for various types of analysis - batch (spark-core), interactive (REPL), streaming (spark-streaming), machine learning (mllib), graph processing (graphx), SQL queries (SparkSQL)
Richer and Simpler API: Spark's API is richer and simpler. Richer because it supports many more operations (e.g., groupBy, filter ...). Simpler because of the expressiveness of these functional constructs. Spark's API supports Java, Scala and Python (for most APIs). There is experimental support for R.
Multiple Datastore Support: Spark supports many data stores out of the box. You can use Spark to analyze data in a normal or distributed file system, HDFS, Amazon S3, Apache Cassandra, Apache Hive and ElasticSearch to name a few. I'm sure support for many other popular data stores is comings soon. This essentially if you want to adopt Spark you don't have to move your data around.

For example, here is what code for word count looks in Spark (Scala).

Click to copy

val textFile = sc.textFile("some file on HDFS")
val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)

I'm sure you have to write a few more lines if you are using standard Hadoop MR.

Here are some common misconceptions about Spark.

Spark is just a in-memory cluster computing framework. However, this is not true. Spark excels when your data can fit in memory because memory access latency is lower. But you can make it work even when your dataset doesn't completely fit in memory.
You need to learn Scala to use Spark. Spark is written in Scala and runs on the JVM. But the Spark provides support for most of the common APIs in Java and Python as well. So you can easily get started with Spark without knowing Scala.
Spark does not scale. Spark is for small datasets (GBs) only and doesn't scale to large number of machines or TBs of data. This is also not true. It has been used successfully to sort PetaBytes of data

Finally, if you do not have a legacy codebase in Hadoop MR it makes perfect sense to adopt Spark, the simple reason being all major Hadoop vendors are moving towards Spark for good reason.

answered Nov 14 '22 15:11

Soumya Simanta

Spark is an order of magnitude faster than mapreduce for iterative algorithms, since it gets a significant speedup from keeping intermediate data cached in the local JVM.

With Spark 1.1 which primarily includes a new shuffle implementation (sort-based shuffle instead of hash-based shuffle), a new network module (based on netty instead of using block manager for sending shuffle data), a new external shuffle service made Spark perform the fastest PetaByte sort (on 190 nodes with 46TB RAM) and TeraByte sort breaking Hadoop's old record.

Spark can easily handle the dataset which are order of magnitude larger than the cluster's aggregate memory. So, my thought is that Spark is heading in the right direction and will eventually get even better.

For reference this blog post explains how databricks performed the petabyte sort.

157

answered Nov 14 '22 14:11

Ashrith

Related questions
                            
                                Pig: Control number of mappers
                            
                                How to Join two tables in Hbase
                            
                                Why does Hadoop Spilling happens?
                            
                                Understanding closures and parallelism in Spark
                            
                                When are files "splittable"?
                            
                                Why datanode sends the block location information to namenode?
                            
                                Convert mm/dd/yyyy to yyyy-mm-dd in Hive
                            
                                Reading Json file using Apache Spark
                            
                                how to implement counters in hadoop streaming in python
                            
                                Pulling data from MySQL into Hadoop
                            
                                Using Hadoop for the First Time, MapReduce Job does not run Reduce Phase
                            
                                Hadoop Streaming Job failed in python
                            
                                Can Hadoop mapper produce multiple keys in output?
                            
                                Hadoop job asks to disable safe node
                            
                                Using Pig/Hive for data processing instead of direct java map reduce code?
                            
                                Hadoop Job : Task fail to report status for 601 seconds
                            
                                Loading Raw JSON into Pig
                            
                                I have an Errno 13 Permission denied with subprocess in python
                            
                                Reducers stopped working at 66.68% while running HIVE Join query
                            
                                What's the right way to use historyserver of hadoop 2.2?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

MapReduce or Spark for Batch processing on Hadoop?

Tags:

apache-spark

hadoop

batch-processing

mapreduce

Venkat Ankam

People also ask

2 Answers

Soumya Simanta

Ashrith

Recent Activity

Donate For Us