Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Apache Spark handles system failure when deployed in YARN?

Preconditions

Let's assume Apache Spark is deployed on a hadoop cluster using YARN. Furthermore a spark execution is running. How does spark handle the situations listed below?

Cases & Questions

  1. One node of the hadoop clusters fails due to a disc error. However replication is high enough and no data was lost.
    • What will happen to tasks that where running at that node?
  2. One node of the hadoop clusters fails due to a disc error. Replication was not high enough and data was lost. Simply spark couldn't find a file anymore which was pre-configured as resource for the work flow.
    • How will it handle this situation?
  3. During execution the primary namenode fails over.
    • Did spark automatically use the fail over namenode?
    • What happens when the secondary namenode fails as well?
  4. For some reasons during a work flow the cluster is totally shut down.
    • Will spark restart with the cluster automatically?
    • Will it resume to the last "save" point during the work flow?

I know, some questions might sound odd. Anyway, I hope you can answer some or all. Thanks in advance. :)

like image 460
Matthias Kricke Avatar asked Jul 15 '14 15:07

Matthias Kricke


People also ask

How does Spark work with YARN?

When running Spark on YARN, each Spark executor runs as a YARN container. Where MapReduce schedules a container and fires up a JVM for each task, Spark hosts multiple tasks within the same container. This approach enables several orders of magnitude faster task startup time.

How does Spark handle node failure?

Spark Streaming write ahead logs If the driver node fails, all the data that was received and replicated in memory will be lost. This will affect the result of the stateful transformation. To avoid the loss of data, Spark 1.2 introduced write ahead logs, which save received data to fault-tolerant storage.

What happens when Spark job fails?

FileAlreadyExistsException in Spark jobs As a result, the FileAlreadyExistsException error occurs. When any Spark executor fails, Spark retries to start the task, which might result into FileAlreadyExistsException error after the maximum number of retries. A sample original executor failure reason is shown below.

How does Spark ensure fault tolerance?

In Apache Spark, the data storage model is based on RDD. RDDs help to achieve fault tolerance through the lineage. RDD always has information on how to build from other datasets. If any partition of an RDD is lost due to failure, lineage helps build only that particular lost partition.


1 Answers

Here are the answers given by the mailing list to the questions (answers where provided by Sandy Ryza of Cloudera):

  1. "Spark will rerun those tasks on a different node."
  2. "After a number of failed task attempts trying to read the block, Spark would pass up whatever error HDFS is returning and fail the job."
  3. "Spark accesses HDFS through the normal HDFS client APIs. Under an HA configuration, these will automatically fail over to the new namenode. If no namenodes are left, the Spark job will fail."
  4. Restart is part of administration and "Spark has support for checkpointing to HDFS, so you would be able to go back to the last time checkpoint was called that HDFS was available."
like image 52
Matthias Kricke Avatar answered Oct 21 '22 10:10

Matthias Kricke