I am running Spark with YARN.
From the link: http://spark.apache.org/docs/latest/running-on-yarn.html
I found explanation of different yarn modes, i.e. the --master option, with which Spark can run:
"There are two deploy modes that can be used to launch Spark applications on YARN. In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN"
Hereby, I can only understand the difference is that where the driver is running, but I can not understand which is running faster. Morevover:
So I do not know how to make the choice, i.e. when to use spark-shell, when to use spark-submit, especially when to use client mode, when to use cluster mode
Spark application can be submitted in two different ways – cluster mode and client mode. In cluster mode, the driver will get started within the cluster in any of the worker machines. So, the client can fire the job and forget it. In client mode, the driver will get started within the client.
The spark-submit syntax is --deploy-mode cluster. Client mode : Submitting Spark batch application and having the driver run on the machine you are submitting from. The spark-submit syntax is --deploy-mode client.
cluster mode is used to run production jobs. In client mode, the driver runs locally from where you are submitting your application using spark-submit command. client mode is majorly used for interactive and debugging purposes.
In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster. The input and output of the application is attached to the console. Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell).
@Avinash A Spark shell is only intended to be use for testing and perhaps development of small applications - is only an interactive shell and should not be use to run production spark applications. For production application deployment you should use spark-submit.
Based on the resource manager, the spark can run in two modes: Local Mode and cluster mode. The way we specify the resource manager is by the way of a command-line option called --master.
spark-shell should be used for interactive queries, it needs to be run in yarn-client mode so that the machine you're running on acts as the driver.
For spark-submit, you submit jobs to the cluster then the task runs in the cluster. Normally you would run in cluster mode so that YARN can assign the driver to a suitable node on the cluster with available resources.
Some commands (like .collect()) send all the data to the driver node, which can cause significant performance differences between whether your driver node is inside the cluster, or on a machine outside the cluster (e.g. a users laptop).
Client mode - Use for interactive queries, where you want to get the direct output (a local machine or edge node). This will run the driver in your local machine / edge node from where you have launched the application.
Cluster mode - This mode will help you launch the driver inside the cluster, irrespective of the machine that you have used to submit the application. YARN will add an application master where this driver will be created and hence become fault tolerant.
For learning purpose client mode is good enough. In production environment you should ALWAYS use cluster mode.
I'll explain you with help of an example. Imagine a scenario where you want to launch multiple applications.Let's say, you have a 5 node cluster with nodes A,B,C,D,E.
The work load will be distributed on all the 5 worker nodes and 1 node is additionally used to submit jobs as well (say 'A' is used for this). Now every-time you launch an application using the client mode, the driver process always run on 'A'.
It might work well for a few jobs but as the jobs keep increasing, 'A' will be short of resources like CPU and Memory.
Imagine the impact on a very large cluster which runs multiple such jobs.
But if you choose the cluster mode, the driver will run on 'A' everytime but be distributed on all the 5 nodes. The resources in this case are more evenly utilized.
Hope this helps you to decide what mode to choose.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With