What is yarn-client mode in Spark?

Tags:

hadoop-yarn

Apache Spark has recently updated the version to 0.8.1, in which yarn-client mode is available. My question is, what does yarn-client mode really mean? In the documentation it says:

With yarn-client mode, the application will be launched locally. Just like running application or spark-shell on Local / Mesos / Standalone mode. The launch method is also the similar with them, just make sure that when you need to specify a master url, use “yarn-client” instead

What does it mean "launched locally"? Locally where? On the Spark cluster?
What is the specific difference from the yarn-standalone mode?

949

asked Dec 27 '13 01:12

2 Answers

A Spark application consists of a driver and one or many executors. The driver program is the main program (where you instantiate SparkContext), which coordinates the executors to run the Spark application. The executors run tasks assigned by the driver.

A YARN application has the following roles: yarn client, yarn application master and list of containers running on the node managers.

When Spark application runs on YARN, it has its own implementation of yarn client and yarn application master.

With those background, the major difference is where the driver program runs.

Yarn Standalone Mode: your driver program is running as a thread of the yarn application master, which itself runs on one of the node managers in the cluster. The Yarn client just pulls status from the application master. This mode is same as a mapreduce job, where the MR application master coordinates the containers to run the map/reduce tasks.
Yarn client mode: your driver program is running on the yarn client where you type the command to submit the spark application (may not be a machine in the yarn cluster). In this mode, although the drive program is running on the client machine, the tasks are executed on the executors in the node managers of the YARN cluster.

Reference: http://spark.incubator.apache.org/docs/latest/cluster-overview.html

answered Oct 06 '22 19:10

Mingjiang Shi

So in spark you have two different components. There is the driver and the workers. In yarn-cluster mode the driver is running remotely on a data node and the workers are running on separate data nodes. In yarn-client mode the driver is on the machine that started the job and the workers are on the data nodes. In local mode the driver and workers are on the machine that started the job.

When you run .collect() the data from the worker nodes get pulled into the driver. It's basically where the final bit of processing happens.

For my self i have found yarn-cluster mode to be better when i'm at home on the vpn, but yarn-client mode is better when i'm running code from within the data center.

Yarn-client mode also means you tie up one less worker node for the driver.

188

answered Oct 06 '22 18:10

ben jarman

Related questions
                            
                                How to restart yarn on AWS EMR
                            
                                Apache Hadoop Yarn - Underutilization of cores
                            
                                Hadoop: Connecting to ResourceManager failed
                            
                                How can I access S3/S3n from a local Hadoop 2.6 installation?
                            
                                Do exit codes and exit statuses mean anything in spark?
                            
                                How to know what is the reason for ClosedChannelExceptions with spark-shell in YARN client mode?
                            
                                How to log using log4j to local file system inside a Spark application that runs on YARN?
                            
                                Why does Hadoop report "Unhealthy Node local-dirs and log-dirs are bad"?
                            
                                Difference between `yarn.scheduler.maximum-allocation-mb` and `yarn.nodemanager.resource.memory-mb`?
                            
                                How to limit the number of retries on Spark job failure?
                            
                                How to set amount of Spark executors?
                            
                                Spark on yarn concept understanding
                            
                                Why does a JVM report more committed memory than the linux process resident set size?
                            
                                What is a container in YARN?
                            
                                FetchFailedException or MetadataFetchFailedException when processing big data set
                            
                                Application report for application_ (state: ACCEPTED) never ends for Spark Submit (with Spark 1.2.0 on YARN)
                            
                                Where are logs in Spark on YARN?
                            
                                Spark yarn cluster vs client - how to choose which one to use?
                            
                                How to prevent Spark Executors from getting Lost when using YARN client mode?
                            
                                What is the relation between 'mapreduce.map.memory.mb' and 'mapred.map.child.java.opts' in Apache Hadoop YARN?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With