Having read this question, I would like to ask additional questions:
Master is per cluster, and Driver is per application. For standalone/yarn clusters, Spark currently supports two deploy modes. In client mode, the driver is launched in the same process as the client that submits the application.
Worker node refers to node which runs the application code in the cluster. Worker Node is the Slave Node. Master node assign work and worker node actually perform the assigned tasks. Worker node processes the data stored on the node, they report the resources to the master.
The Spark Master is the process that requests resources in the cluster and makes them available to the Spark Driver. In all deployment modes, the Master negotiates resources or containers with Worker nodes or slave nodes and tracks their status and monitors their progress.
Each Worker node consists of one or more Executor(s) who are responsible for running the Task. Executors register themselves with Driver. The Driver has all the information about the Executors at all the time. This working combination of Driver and Workers is known as Spark Application.
1. The Cluster Manager is a long-running service, on which node it is running?
Cluster Manager is Master process in Spark standalone mode. It can be started anywhere by doing ./sbin/start-master.sh
, in YARN it would be Resource Manager.
2. Is it possible that the Master and the Driver nodes will be the same machine? I presume that there should be a rule somewhere stating that these two nodes should be different?
Master
is per cluster, and Driver
is per application. For standalone/yarn clusters, Spark currently supports two deploy modes.
If an application submitted with --deploy-mode client
in Master node, both Master and Driver will be on the same node. check deployment of Spark application over YARN
3. In the case where the Driver node fails, who is responsible for re-launching the application? And what will happen exactly? i.e. how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order?
If the driver fails, all executors tasks will be killed for that submitted/triggered spark application.
4. In the case where the Master node fails, what will happen exactly and who is responsible for recovering from the failure?
Master node failures are handled in two ways.
Standby Masters with ZooKeeper:
Utilizing ZooKeeper to provide leader election and some state storage, you can launch multiple Masters in your cluster connected to the same ZooKeeper instance. One will be elected “leader” and the others will remain in standby mode. If the current leader dies, another Master will be elected, recover the old Master’s state, and then resume scheduling. The entire recovery process (from the time the first leader goes down) should take between 1 and 2 minutes. Note that this delay only affects scheduling new applications – applications that were already running during Master failover are unaffected. check here for configurations
Single-Node Recovery with Local File System:
ZooKeeper is the best way to go for production-level high availability, but if you want to be able to restart the Master if it goes down, FILESYSTEM mode can take care of it. When applications and Workers register, they have enough state written to the provided directory so that they can be recovered upon a restart of the Master process. check here for conf and more details
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With