Preconditions Let's assume Apache Spark is deployed on a hadoop cluster using YARN. Furthermore a spark execution is running. How does spark handle the situations listed below? Cases & Questions <ol> <li>One node of the hadoop clusters fails due to a disc error. However replication is high enough and no data was lost. <ul> <li>What will happen to tasks that where running at that node?</li> </ul> </li> <li>One node of the hadoop clusters fails due to a disc error. Replication was not high enough and data was lost. Simply spark couldn't find a file anymore which was pre-configured as resource for the work flow. <ul> <li>How will it handle this situation?</li> </ul> </li> <li>During execution the primary namenode fails over. <ul> <li>Did spark automatically use the fail over namenode?</li> <li>What happens when the secondary namenode fails as well?</li> </ul> </li> <li>For some reasons during a work flow the cluster is totally shut down. <ul> <li> Will spark restart with the cluster automatically? </li> <li>Will it resume to the last "save" point during the work flow?</li> </ul> </li> </ol> I know, some questions might sound odd. Anyway, I hope you can answer some or all. Thanks in advance. :)

Here are the answers given by the mailing list to the questions (answers where provided by Sandy Ryza of Cloudera): <ol> <li>"Spark will rerun those tasks on a different node."</li> <li>"After a number of failed task attempts trying to read the block, Spark would pass up whatever error HDFS is returning and fail the job."</li> <li>"Spark accesses HDFS through the normal HDFS client APIs. Under an HA configuration, these will automatically fail over to the new namenode. If no namenodes are left, the Spark job will fail."</li> <li>Restart is part of administration and "Spark has support for checkpointing to HDFS, so you would be able to go back to the last time checkpoint was called that HDFS was available." </li> </ol>

How does Apache Spark handles system failure when deployed in YARN?

1 Answers

Here are the answers given by the mailing list to the questions (answers where provided by Sandy Ryza of Cloudera):

"Spark will rerun those tasks on a different node."
"After a number of failed task attempts trying to read the block, Spark would pass up whatever error HDFS is returning and fail the job."
"Spark accesses HDFS through the normal HDFS client APIs. Under an HA configuration, these will automatically fail over to the new namenode. If no namenodes are left, the Spark job will fail."
Restart is part of administration and "Spark has support for checkpointing to HDFS, so you would be able to go back to the last time checkpoint was called that HDFS was available."

answered Oct 21 '22 10:10

Matthias Kricke

Related questions
                            
                                unable to create hive table with primary key
                            
                                HADOOP / YARN - Are the ResourceManager and the hdfs NameNode always installed on the same host?
                            
                                Hive query stuck at 99%
                            
                                What is the difference between Statement.setMaxRows vs Statement.setFetchsize in Hive
                            
                                Different ways to import files into HDFS
                            
                                How many types of InputFormat is there in Hadoop?
                            
                                What is the principle of "code moving to data" rather than data to code?
                            
                                Spark job just hangs with large data
                            
                                Unable to run UDF on hive server
                            
                                Generating all fields from an alias after a JOIN in Pig
                            
                                hadoop fs commands are showing the local filesystem not the hdfs
                            
                                Hadoop: FSCK result shows missing replicas
                            
                                Unable to establish a JDBC connection to Hive from Eclipse
                            
                                merge multiple small files in to few larger files in Spark
                            
                                Hadoop fs -du-h sorting by size for M, G, T, P, E, Z, Y
                            
                                Forward fill missing values in Spark/Python
                            
                                Hive Data to Pandas Data frame
                            
                                Stream data into hdfs directly without copying
                            
                                org.apache.maven.plugin.MojoExecutionException: protoc failure
                            
                                Deleting files from HDFS does not free up disk space

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does Apache Spark handles system failure when deployed in YARN?

Tags:

apache-spark

hadoop

hadoop-yarn

Matthias Kricke

People also ask

1 Answers

Matthias Kricke

Recent Activity

Donate For Us