Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between Spark Standalone, YARN and local mode?

Tags:

apache-spark

People also ask

What is the difference between Spark standalone and YARN?

Spark standalone mode requires each application to run an executor on every node in the cluster, whereas with YARN, you choose the number of executors to use.

What is standalone mode in Spark?

Spark's standalone mode offers a web-based user interface to monitor the cluster. The master and each worker has its own web UI that shows cluster and job statistics. By default, you can access the web UI for the master at port 8080. The port can be changed either in the configuration file or via command-line options.

What is the difference between Spark and YARN?

Yarn is a distributed container manager, like Mesos for example, whereas Spark is a data processing tool. Spark can run on Yarn, the same way Hadoop Map Reduce can run on Yarn. It just happens that Hadoop Map Reduce is a feature that ships with Yarn, when Spark is not.

What is YARN mode in Spark?

Spark supports two modes for running on YARN, “yarn-cluster” mode and “yarn-client” mode. Broadly, yarn-cluster mode makes sense for production jobs, while yarn-client mode makes sense for interactive and debugging uses where you want to see your application's output immediately.


You are getting confused with Hadoop YARN and Spark.

YARN is a software rewrite that decouples MapReduce's resource management and scheduling capabilities from the data processing component, enabling Hadoop to support more varied processing approaches and a broader array of applications.

With the introduction of YARN, Hadoop has opened to run other applications on the platform.

In short YARN is "Pluggable Data Parallel framework".

Apache Spark

Apache spark is a Batch interactive Streaming Framework. Spark has a "pluggable persistent store". Spark can run with any persistence layer.

For spark to run it needs resources. In standalone mode you start workers and spark master and persistence layer can be any - HDFS, FileSystem, cassandra etc. In YARN mode you are asking YARN-Hadoop cluster to manage the resource allocation and book keeping.

When you use master as local[2] you request Spark to use 2 core's and run the driver and workers in the same JVM. In local mode all spark job related tasks run in the same JVM.

So the only difference between Standalone and local mode is that in Standalone you are defining "containers" for the worker and spark master to run in your machine (so you can have 2 workers and your tasks can be distributed in the JVM of those two workers?) but in local mode you are just running everything in the same JVM in your local machine.



Local mode
Think of local mode as executing a program on your laptop using single JVM. It can be java, scala or python program where you have defined & used spark context object, imported spark libraries and processed data residing in your system.


YARN
In reality Spark programs are meant to process data stored across machines. Executors process data stored on these machines. We need a utility to monitor executors and manage resources on these machines( clusters). Hadoop has its own resources manager for this purpose. So when you run spark program on HDFS you can leverage hadoop's resource manger utility i.e. yarn. Hadoop properties is obtained from ‘HADOOP_CONF_DIR’ set inside spark-env.sh or bash_profile


Spark Standalone
Spark distribution comes with its own resource manager also. When your program uses spark's resource manager, execution mode is called Standalone. Moreover, Spark allows us to create distributed master-slave architecture, by configuring properties file under $SPARK_HOME/conf directory. By Default it is set as single node cluster just like hadoop's psudo-distribution-mode.