Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understand Spark: Cluster Manager, Master and Driver nodes

Tags:

Having read this question, I would like to ask additional questions:

  1. The Cluster Manager is a long-running service, on which node it is running?
  2. Is it possible that the Master and the Driver nodes will be the same machine? I presume that there should be a rule somewhere stating that these two nodes should be different?
  3. In case where the Driver node fails, who is responsible of re-launching the application? and what will happen exactly? i.e. how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order?
  4. Similarly to the previous question: In case where the Master node fails, what will happen exactly and who is responsible of recovering from the failure?
like image 909
Rami Avatar asked Jan 11 '16 13:01

Rami


People also ask

What is master and driver in Spark?

Master is per cluster, and Driver is per application. For standalone/yarn clusters, Spark currently supports two deploy modes. In client mode, the driver is launched in the same process as the client that submits the application.

What is master node and worker node in Spark?

Worker node refers to node which runs the application code in the cluster. Worker Node is the Slave Node. Master node assign work and worker node actually perform the assigned tasks. Worker node processes the data stored on the node, they report the resources to the master.

What is the role of master node in Spark?

The Spark Master is the process that requests resources in the cluster and makes them available to the Spark Driver. In all deployment modes, the Master negotiates resources or containers with Worker nodes or slave nodes and tracks their status and monitors their progress.

What is worker node and driver node?

Each Worker node consists of one or more Executor(s) who are responsible for running the Task. Executors register themselves with Driver. The Driver has all the information about the Executors at all the time. This working combination of Driver and Workers is known as Spark Application.


1 Answers

1. The Cluster Manager is a long-running service, on which node it is running?

Cluster Manager is Master process in Spark standalone mode. It can be started anywhere by doing ./sbin/start-master.sh, in YARN it would be Resource Manager.

2. Is it possible that the Master and the Driver nodes will be the same machine? I presume that there should be a rule somewhere stating that these two nodes should be different?

Master is per cluster, and Driver is per application. For standalone/yarn clusters, Spark currently supports two deploy modes.

  1. In client mode, the driver is launched in the same process as the client that submits the application.
  2. In cluster mode, however, for standalone, the driver is launched from one of the Worker & for yarn, it is launched inside application master node and the client process exits as soon as it fulfils its responsibility of submitting the application without waiting for the app to finish.

If an application submitted with --deploy-mode client in Master node, both Master and Driver will be on the same node. check deployment of Spark application over YARN

3. In the case where the Driver node fails, who is responsible for re-launching the application? And what will happen exactly? i.e. how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order?

If the driver fails, all executors tasks will be killed for that submitted/triggered spark application.

4. In the case where the Master node fails, what will happen exactly and who is responsible for recovering from the failure?

Master node failures are handled in two ways.

  1. Standby Masters with ZooKeeper:

    Utilizing ZooKeeper to provide leader election and some state storage, you can launch multiple Masters in your cluster connected to the same ZooKeeper instance. One will be elected “leader” and the others will remain in standby mode. If the current leader dies, another Master will be elected, recover the old Master’s state, and then resume scheduling. The entire recovery process (from the time the first leader goes down) should take between 1 and 2 minutes. Note that this delay only affects scheduling new applications – applications that were already running during Master failover are unaffected. check here for configurations

  2. Single-Node Recovery with Local File System:

    ZooKeeper is the best way to go for production-level high availability, but if you want to be able to restart the Master if it goes down, FILESYSTEM mode can take care of it. When applications and Workers register, they have enough state written to the provided directory so that they can be recovered upon a restart of the Master process. check here for conf and more details

like image 116
mrsrinivas Avatar answered Oct 06 '22 08:10

mrsrinivas