On which way does RDD of spark finish fault-tolerance?

Tags:

apache-spark

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. But, I did not find the internal mechanism on which the RDD finish fault-tolerance. Could somebody describe this mechanism？Thanks.

410

asked Aug 28 '16 08:08

Ivan Lee

1 Answers

Let me explain in very simple terms as I understand.

Faults in a cluster can happen when one of the nodes processing data is crashed. In spark terms, RDD is split into partitions and each node (called the executors) is operating on a partition at any point of time. (Theoretically, each each executor can be assigned multiple tasks depending on the number of cores assigned to the job versus the number of partitions present in the RDD).

By operation, what is really happening is a series of Scala functions (called transformations and actions in Spark terms depending on if the function is pure or side-effecting) executing on a partition of the RDD. These operations are composed together and Spark execution engine views these as a Directed Acyclic Graph of operations.

Now, if a particular node crashes in the middle of an operation Z, which depended on operation Y, which inturn on operation X. The cluster manager (YARN/Mesos) finds out the node is dead and tries to assign another node to continue processing. This node will be told to operate on the particular partition of the RDD and the series of operations X->Y->Z (called lineage) that it has to execute, by passing in the Scala closures created from the application code. Now the new node can happily continue processing and there is effectively no data-loss.

Spark also uses this mechanism to guarantee exactly-once processing, with the caveat that any side-effecting operation that you do like calling a database in a Spark Action block can be invoked multiple times. But if you view your transformations like pure functional mapping from one RDD to another, then you can be rest assured that the resulting RDD will have the elements from the source RDD processed only once.

The domain of fault-tolerance in Spark is very vast and it needs much bigger explanation. I am hoping to see others coming up with technical details on how this is implemented, etc. Thanks for the great topic though.

144

answered Sep 21 '22 21:09

Ramkumar Venkataraman

Related questions
                            
                                Add one more StructField to schema
                            
                                Loading compressed gzipped csv file in Spark 2.0
                            
                                What is StringIndexer , VectorIndexer, and how to use them?
                            
                                Mapping Spark DataSet row values into new hash column
                            
                                External Hive Table Refresh table vs MSCK Repair
                            
                                get first N elements from dataframe ArrayType column in pyspark
                            
                                Spark: save DataFrame partitioned by "virtual" column
                            
                                Spark: get number of cluster cores programmatically
                            
                                How do I filter rows based on whether a column value is in a Set of Strings in a Spark DataFrame
                            
                                what is exact difference between Spark Transform in DStream and map.?
                            
                                How do I convert an RDD with a SparseVector Column to a DataFrame with a column as Vector
                            
                                is Parquet predicate pushdown works on S3 using Spark non EMR?
                            
                                Spark: Join dataframe column with an array
                            
                                Write spark dataframe to file using python and '|' delimiter
                            
                                How to use from_json with Kafka connect 0.10 and Spark Structured Streaming?
                            
                                How to start multiple streaming queries in a single Spark application?
                            
                                PySpark: how to resample frequencies
                            
                                Enable case sensitivity for spark.sql globally
                            
                                How to interpret results of Spark OneHotEncoder
                            
                                Spark converting a Dataset to RDD

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With