How to make Spark driver resilient to Master restarts?

Tags:

I have a Spark Standalone (not YARN/Mesos) cluster and a driver app running (in client mode), which talks to that cluster to execute its tasks. However, if I shutdown and restart the Spark master and workers, the driver does not reconnect to the master and resume its work.

Perhaps I am confused about the relationship between the Spark Master and the driver. In a situation like this, is the Master responsible for reconnecting back to the driver? If so, does the Master serialize its current state to disk somewhere that it can restore on restart?

948

asked Oct 13 '16 15:10

dOxxx

1 Answers

In a situation like this, is the Master responsible for reconnecting back to the driver? If so, does the Master serialize its current state to disk somewhere that it can restore on restart?

The relationship between the Master node and the driver depends on a few factors. First, the driver is the one hosting your SparkContext/StreamingContext and is the in charge of the jobs execution. It is the one that creates the DAG, and holds the DAGScheduler and TaskScheduler which assign stages/tasks respectively. The Master Node may serve as the host for the driver in case you use Spark Standalone and run your job in "Client Mode". That way, the Master also hosts the driver process and if it dies the driver dies as with it. In case "Cluster mode" is used, the driver resides on one of the Worker nodes, and communicates with the Master frequently to get the status of the current running job, send back metadata regarding the status of the completed batches, etc.

Running on Standalone, if the Master dies and you restart it, the Master does not re-execute the jobs that were previously ran. In order to achieve this, you can create and provide the cluster with an additional Master node, and set it up so ZooKeeper can hold the Masters state, and interchange between the two in case of failure. When you set up the cluster in such a way, the Master knows about it's previously executed jobs and resumes them on your behalf the new Master has taken the lead.

You can read how to create a standby Spark Master node in the documentation.

171

answered Oct 11 '22 09:10

Yuval Itzchakov

Related questions
                            
                                Spark Streaming groupByKey and updateStateByKey implementation
                            
                                Spark SQL performance
                            
                                Using PartitionBy to split and efficiently compute RDD groups by Key
                            
                                Apache Phoenix vs Hive-Spark
                            
                                Spark Task not serializable (Case Classes)
                            
                                Is there a way to rewrite Spark RDD distinct to use mapPartitions instead of distinct?
                            
                                how to build a graph from tuples in graphx and label the nodes after ?
                            
                                Why do Window functions fail with "Window function X does not take a frame specification"?
                            
                                howto add hive properties at runtime in spark-shell
                            
                                How to submit code to a remote Spark cluster from IntelliJ IDEA
                            
                                Spark Python error "FileNotFoundError: [WinError 2] The system cannot find the file specified"
                            
                                What is the most efficient way to do a sorted reduce in PySpark?
                            
                                Combining Spark Streaming + MLlib
                            
                                Read Kafka topic in a Spark batch job
                            
                                PySpark: retrieve mean and the count of values around the mean for groups within a dataframe
                            
                                Running Spark on Linux : $JAVA_HOME not set error
                            
                                Inspecting GraphX Graph Object
                            
                                GroupByKey with datasets in Spark 2.0 using Java
                            
                                Outlier detection algorithm spark mllib
                            
                                Hadoop Yarn: How to limit dynamic self allocation of resources with Spark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to make Spark driver resilient to Master restarts?

Tags:

apache-spark

apache-spark-standalone

dOxxx

People also ask

1 Answers

Yuval Itzchakov

Recent Activity

Donate For Us