Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark on yarn concept understanding

I am trying to understand how spark runs on YARN cluster/client. I have the following question in my mind.

  1. Is it necessary that spark is installed on all the nodes in yarn cluster? I think it should because worker nodes in cluster execute a task and should be able to decode the code(spark APIs) in spark application sent to cluster by the driver?

  2. It says in the documentation "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster". Why does client node have to install Hadoop when it is sending the job to cluster?

like image 434
Sporty Avatar asked Jul 23 '14 12:07

Sporty


People also ask

How does Spark work with YARN?

In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

What is Spark and how it works?

Spark is a general-purpose distributed data processing engine that is suitable for use in a wide range of circumstances. On top of the Spark core data processing engine, there are libraries for SQL, machine learning, graph computation, and stream processing, which can be used together in an application.

What is Hadoop Spark YARN?

YARN is a generic resource-management framework for distributed workloads; in other words, a cluster-level operating system. Although part of the Hadoop ecosystem, YARN can support a lot of varied compute-frameworks (such as Tez, and Spark) in addition to MapReduce.

Does Spark work without YARN?

As per Spark documentation, Spark can run without Hadoop. You may run it as a Standalone mode without any resource manager. But if you want to run in multi-node setup, you need a resource manager like YARN or Mesos and a distributed file system like HDFS,S3 etc.


2 Answers

Adding to other answers.

  1. Is it necessary that spark is installed on all the nodes in the yarn cluster?

No, If the spark job is scheduling in YARN(either client or cluster mode). Spark installation is needed in many nodes only for standalone mode.

These are the visualizations of spark app deployment modes.

Spark Standalone Cluster

Spark standalone mode

In cluster mode driver will be sitting in one of the Spark Worker node whereas in client mode it will be within the machine which launched the job.


YARN cluster mode

YARN cluster mode

YARN client mode

YARN client mode

This table offers a concise list of differences between these modes:

differences among Standalone, YARN Cluster and YARN Client modes

pics source

  1. It says in the documentation "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client-side) configuration files for the Hadoop cluster". Why does the client node have to install Hadoop when it is sending the job to cluster?

Hadoop installation is not mandatory but configurations(not all) are!. We can call them Gateway nodes. It's for two main reasons.

  • The configuration contained in HADOOP_CONF_DIR directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration.
  • In YARN mode the ResourceManager’s address is picked up from the Hadoop configuration(yarn-default.xml). Thus, the --master parameter is yarn.

Update: (2017-01-04)

Spark 2.0+ no longer requires a fat assembly jar for production deployment. source

like image 79
mrsrinivas Avatar answered Oct 06 '22 02:10

mrsrinivas


We are running spark jobs on YARN (we use HDP 2.2).

We don't have spark installed on the cluster. We only added the Spark assembly jar to the HDFS.

For example to run the Pi example:

./bin/spark-submit \   --verbose \   --class org.apache.spark.examples.SparkPi \   --master yarn-cluster \   --conf spark.yarn.jar=hdfs://master:8020/spark/spark-assembly-1.3.1-hadoop2.6.0.jar \   --num-executors 2 \   --driver-memory 512m \   --executor-memory 512m \   --executor-cores 4 \   hdfs://master:8020/spark/spark-examples-1.3.1-hadoop2.6.0.jar 100 

--conf spark.yarn.jar=hdfs://master:8020/spark/spark-assembly-1.3.1-hadoop2.6.0.jar - This config tell the yarn from were to take the spark assembly. If you don't use it, it will upload the jar from were you run spark-submit.

About your second question: The client node doesn't not need Hadoop installed. It only needs the configuration files. You can copy the directory from your cluster to your client.

like image 41
RanP Avatar answered Oct 06 '22 01:10

RanP