Spark-submit / spark-shell > difference between yarn-client and yarn-cluster mode

Tags:

hadoop-yarn

I am running Spark with YARN.

From the link: http://spark.apache.org/docs/latest/running-on-yarn.html

I found explanation of different yarn modes, i.e. the --master option, with which Spark can run:

"There are two deploy modes that can be used to launch Spark applications on YARN. In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN"

Hereby, I can only understand the difference is that where the driver is running, but I can not understand which is running faster. Morevover:

In case of running Spark-submit, the --master can be either client or cluster
Correspondingly Spark-shell's master option can be yarn-client but it does not support cluster mode

So I do not know how to make the choice, i.e. when to use spark-shell, when to use spark-submit, especially when to use client mode, when to use cluster mode

727

asked Oct 20 '15 10:10

Rui

3 Answers

spark-shell should be used for interactive queries, it needs to be run in yarn-client mode so that the machine you're running on acts as the driver.

For spark-submit, you submit jobs to the cluster then the task runs in the cluster. Normally you would run in cluster mode so that YARN can assign the driver to a suitable node on the cluster with available resources.

Some commands (like .collect()) send all the data to the driver node, which can cause significant performance differences between whether your driver node is inside the cluster, or on a machine outside the cluster (e.g. a users laptop).

answered Sep 28 '22 03:09

Ewan Leith

Client mode - Use for interactive queries, where you want to get the direct output (a local machine or edge node). This will run the driver in your local machine / edge node from where you have launched the application.

Cluster mode - This mode will help you launch the driver inside the cluster, irrespective of the machine that you have used to submit the application. YARN will add an application master where this driver will be created and hence become fault tolerant.

answered Sep 28 '22 02:09

Abhishek Sakhuja

For learning purpose client mode is good enough. In production environment you should ALWAYS use cluster mode.

I'll explain you with help of an example. Imagine a scenario where you want to launch multiple applications.Let's say, you have a 5 node cluster with nodes A,B,C,D,E.

The work load will be distributed on all the 5 worker nodes and 1 node is additionally used to submit jobs as well (say 'A' is used for this). Now every-time you launch an application using the client mode, the driver process always run on 'A'.

It might work well for a few jobs but as the jobs keep increasing, 'A' will be short of resources like CPU and Memory.

Imagine the impact on a very large cluster which runs multiple such jobs.

But if you choose the cluster mode, the driver will run on 'A' everytime but be distributed on all the 5 nodes. The resources in this case are more evenly utilized.

Hope this helps you to decide what mode to choose.

answered Sep 28 '22 01:09

Saurabh

Related questions
                            
                                Spark Scala Understanding reduceByKey(_ + _)
                            
                                Spark Standalone Number Executors/Cores Control
                            
                                Missing SPARK_HOME when using SparkLauncher on AWS EMR cluster
                            
                                Scalatest and Spark giving "java.io.NotSerializableException: org.scalatest.Assertions$AssertionsHelper"
                            
                                How to skip lines while reading a CSV file as a dataFrame using PySpark?
                            
                                How to process a range of hbase rows using spark?
                            
                                How to process multi line input records in Spark
                            
                                Hive doesn't read partitioned parquet files generated by Spark
                            
                                Kafka Producer - org.apache.kafka.common.serialization.StringSerializer could not be found
                            
                                Graphx Visualization
                            
                                reading json file in pyspark
                            
                                how can i add a timestamp as an extra column to my dataframe
                            
                                Saving contents of df.show() as a string in spark-scala app
                            
                                If dataframes in Spark are immutable, why are we able to modify it with operations such as withColumn()?
                            
                                Spark - How to count number of records by key
                            
                                How spark driver serializes the task that is sent to executors?
                            
                                Pyspark changing type of column from date to string
                            
                                How to add my own function as a custom stage in a ML pyspark Pipeline? [duplicate]
                            
                                How to get rows from DF that contain value None in pyspark (spark)
                            
                                Spark import of Parquet files converts strings to bytearray

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With