Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark-submit / spark-shell > difference between yarn-client and yarn-cluster mode

I am running Spark with YARN.

From the link: http://spark.apache.org/docs/latest/running-on-yarn.html

I found explanation of different yarn modes, i.e. the --master option, with which Spark can run:

"There are two deploy modes that can be used to launch Spark applications on YARN. In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN"

Hereby, I can only understand the difference is that where the driver is running, but I can not understand which is running faster. Morevover:

  • In case of running Spark-submit, the --master can be either client or cluster
  • Correspondingly Spark-shell's master option can be yarn-client but it does not support cluster mode

So I do not know how to make the choice, i.e. when to use spark-shell, when to use spark-submit, especially when to use client mode, when to use cluster mode

like image 727
Rui Avatar asked Oct 20 '15 10:10

Rui


People also ask

What's the diff between cluster and client execution using Spark-submit?

Spark application can be submitted in two different ways – cluster mode and client mode. In cluster mode, the driver will get started within the cluster in any of the worker machines. So, the client can fire the job and forget it. In client mode, the driver will get started within the client.

What is the syntax of Spark-submit command what is diff between client and cluster mode?

The spark-submit syntax is --deploy-mode cluster. Client mode : Submitting Spark batch application and having the driver run on the machine you are submitting from. The spark-submit syntax is --deploy-mode client.

Which is better client or cluster mode in Spark?

cluster mode is used to run production jobs. In client mode, the driver runs locally from where you are submitting your application using spark-submit command. client mode is majorly used for interactive and debugging purposes.

What is client mode in Spark-submit?

In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster. The input and output of the application is attached to the console. Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell).

What is Spark shell and Spark-submit?

@Avinash A Spark shell is only intended to be use for testing and perhaps development of small applications - is only an interactive shell and should not be use to run production spark applications. For production application deployment you should use spark-submit.

Can we run Spark Shell in cluster mode?

Based on the resource manager, the spark can run in two modes: Local Mode and cluster mode. The way we specify the resource manager is by the way of a command-line option called --master.


3 Answers

spark-shell should be used for interactive queries, it needs to be run in yarn-client mode so that the machine you're running on acts as the driver.

For spark-submit, you submit jobs to the cluster then the task runs in the cluster. Normally you would run in cluster mode so that YARN can assign the driver to a suitable node on the cluster with available resources.

Some commands (like .collect()) send all the data to the driver node, which can cause significant performance differences between whether your driver node is inside the cluster, or on a machine outside the cluster (e.g. a users laptop).

like image 62
Ewan Leith Avatar answered Sep 28 '22 03:09

Ewan Leith


Client mode - Use for interactive queries, where you want to get the direct output (a local machine or edge node). This will run the driver in your local machine / edge node from where you have launched the application.

Cluster mode - This mode will help you launch the driver inside the cluster, irrespective of the machine that you have used to submit the application. YARN will add an application master where this driver will be created and hence become fault tolerant.

like image 24
Abhishek Sakhuja Avatar answered Sep 28 '22 02:09

Abhishek Sakhuja


For learning purpose client mode is good enough. In production environment you should ALWAYS use cluster mode.

I'll explain you with help of an example. Imagine a scenario where you want to launch multiple applications.Let's say, you have a 5 node cluster with nodes A,B,C,D,E.

The work load will be distributed on all the 5 worker nodes and 1 node is additionally used to submit jobs as well (say 'A' is used for this). Now every-time you launch an application using the client mode, the driver process always run on 'A'.

It might work well for a few jobs but as the jobs keep increasing, 'A' will be short of resources like CPU and Memory.

Imagine the impact on a very large cluster which runs multiple such jobs.

But if you choose the cluster mode, the driver will run on 'A' everytime but be distributed on all the 5 nodes. The resources in this case are more evenly utilized.

Hope this helps you to decide what mode to choose.

like image 31
Saurabh Avatar answered Sep 28 '22 01:09

Saurabh