Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make Spark driver resilient to Master restarts?

I have a Spark Standalone (not YARN/Mesos) cluster and a driver app running (in client mode), which talks to that cluster to execute its tasks. However, if I shutdown and restart the Spark master and workers, the driver does not reconnect to the master and resume its work.

Perhaps I am confused about the relationship between the Spark Master and the driver. In a situation like this, is the Master responsible for reconnecting back to the driver? If so, does the Master serialize its current state to disk somewhere that it can restore on restart?

like image 948
dOxxx Avatar asked Oct 13 '16 15:10

dOxxx


People also ask

Is the driver the master in Spark?

The Apache Spark framework uses a master-slave architecture that consists of a driver, which runs as a master node, and many executors that run across as worker nodes in the cluster. Apache Spark can be used for batch processing and real-time processing as well.

What happens if master node fails in Spark?

So, yes, failing on master will result in executors not able to communicate with it. So, they will stop working. Failing of master will make driver unable to communicate with it for job status. So, your application will fail.

What happens if driver program fails in Spark?

Driver Node Failure The driver node's purpose is to run the Spark Streaming application; however, if it fails, the SparkContent will be lost, and the executors will be unable to access any in-memory data.


1 Answers

In a situation like this, is the Master responsible for reconnecting back to the driver? If so, does the Master serialize its current state to disk somewhere that it can restore on restart?

The relationship between the Master node and the driver depends on a few factors. First, the driver is the one hosting your SparkContext/StreamingContext and is the in charge of the jobs execution. It is the one that creates the DAG, and holds the DAGScheduler and TaskScheduler which assign stages/tasks respectively. The Master Node may serve as the host for the driver in case you use Spark Standalone and run your job in "Client Mode". That way, the Master also hosts the driver process and if it dies the driver dies as with it. In case "Cluster mode" is used, the driver resides on one of the Worker nodes, and communicates with the Master frequently to get the status of the current running job, send back metadata regarding the status of the completed batches, etc.

Running on Standalone, if the Master dies and you restart it, the Master does not re-execute the jobs that were previously ran. In order to achieve this, you can create and provide the cluster with an additional Master node, and set it up so ZooKeeper can hold the Masters state, and interchange between the two in case of failure. When you set up the cluster in such a way, the Master knows about it's previously executed jobs and resumes them on your behalf the new Master has taken the lead.

You can read how to create a standby Spark Master node in the documentation.

like image 171
Yuval Itzchakov Avatar answered Oct 11 '22 09:10

Yuval Itzchakov