Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is yarn-client mode in Spark?

Apache Spark has recently updated the version to 0.8.1, in which yarn-client mode is available. My question is, what does yarn-client mode really mean? In the documentation it says:

With yarn-client mode, the application will be launched locally. Just like running application or spark-shell on Local / Mesos / Standalone mode. The launch method is also the similar with them, just make sure that when you need to specify a master url, use “yarn-client” instead

What does it mean "launched locally"? Locally where? On the Spark cluster?
What is the specific difference from the yarn-standalone mode?

like image 949
zxz Avatar asked Dec 27 '13 01:12

zxz


People also ask

What is YARN client mode?

In yarn-cluster mode the driver is running remotely on a data node and the workers are running on separate data nodes. In yarn-client mode the driver is on the machine that started the job and the workers are on the data nodes. In local mode the driver and workers are on the machine that started the job.

What is client mode in Spark?

client. In client mode, the driver runs locally from where you are submitting your application using spark-submit command. client mode is majorly used for interactive and debugging purposes. Note that in client mode only the driver runs locally and all tasks run on cluster worker nodes.

What is YARN used for in Spark?

There are two deploy modes that can be used to launch Spark applications on YARN. In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application.

What are the various modes in which Spark runs on YARN?

There are two deploy modes that can be used to launch Spark applications on YARN. In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application.


2 Answers

A Spark application consists of a driver and one or many executors. The driver program is the main program (where you instantiate SparkContext), which coordinates the executors to run the Spark application. The executors run tasks assigned by the driver.

A YARN application has the following roles: yarn client, yarn application master and list of containers running on the node managers.

When Spark application runs on YARN, it has its own implementation of yarn client and yarn application master.

With those background, the major difference is where the driver program runs.

  1. Yarn Standalone Mode: your driver program is running as a thread of the yarn application master, which itself runs on one of the node managers in the cluster. The Yarn client just pulls status from the application master. This mode is same as a mapreduce job, where the MR application master coordinates the containers to run the map/reduce tasks.
  2. Yarn client mode: your driver program is running on the yarn client where you type the command to submit the spark application (may not be a machine in the yarn cluster). In this mode, although the drive program is running on the client machine, the tasks are executed on the executors in the node managers of the YARN cluster.

Reference: http://spark.incubator.apache.org/docs/latest/cluster-overview.html

like image 26
Mingjiang Shi Avatar answered Oct 06 '22 19:10

Mingjiang Shi


So in spark you have two different components. There is the driver and the workers. In yarn-cluster mode the driver is running remotely on a data node and the workers are running on separate data nodes. In yarn-client mode the driver is on the machine that started the job and the workers are on the data nodes. In local mode the driver and workers are on the machine that started the job.

When you run .collect() the data from the worker nodes get pulled into the driver. It's basically where the final bit of processing happens.

For my self i have found yarn-cluster mode to be better when i'm at home on the vpn, but yarn-client mode is better when i'm running code from within the data center.

Yarn-client mode also means you tie up one less worker node for the driver.

like image 188
ben jarman Avatar answered Oct 06 '22 18:10

ben jarman