The doc https://spark.apache.org/docs/1.1.0/submitting-applications.html
describes deploy-mode as :
--deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)
Using this diagram fig1
as a guide (taken from http://spark.apache.org/docs/1.2.0/cluster-overview.html) :
If I kick off a Spark job :
./bin/spark-submit \ --class com.driver \ --master spark://MY_MASTER:7077 \ --executor-memory 845M \ --deploy-mode client \ ./bin/Driver.jar
Then the Driver Program
will be MY_MASTER
as specified in fig1
MY_MASTER
If instead I use --deploy-mode cluster
then the Driver Program
will be shared among the Worker Nodes ? If this is true then does this mean that the Driver Program
box in fig1
can be dropped (as it is no longer utilized) as the SparkContext
will also be shared among the worker nodes ?
What conditions should cluster
be used instead of client
?
cluster mode is used to run production jobs. In client mode, the driver runs locally from where you are submitting your application using spark-submit command. client mode is majorly used for interactive and debugging purposes.
In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
Deploy mode specifies the location of where driver executes in the deployment environment. Deploy mode can be one of the following options: client (default) - the driver runs on the machine that the Spark application was launched. cluster - the driver runs on a random node in a cluster.
Basically, there are two types of “Deploy modes” in spark, such as “Client mode” and “Cluster mode”.
No, when deploy-mode is client
, the Driver Program is not necessarily the master node. You could run spark-submit on your laptop, and the Driver Program would run on your laptop.
On the contrary, when deploy-mode is cluster
, then cluster manager (master node) is used to find a slave having enough available resources to execute the Driver Program. As a result, the Driver Program would run on one of the slave nodes. As its execution is delegated, you can not get the result from Driver Program, it must store its results in a file, database, etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With