Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Misunderstanding of spark RDD fault tolerant

Many say:

Spark does not replicate data in hdfs.

Spark arranges the operations in DAG graph.Spark builds RDD lineage. If a RDD is lost they can be rebuilt with the help of lineage graph. So there is no need of data replication as the RDDS can be recalculated from the lineage graph.

And my question is:

If a node fails, spark will only recompute the RDD partitions lost on this node, but where does the data source needed in the recompution process come from ? Do you mean its parent RDD is still there when the node fails?What if the RDD that lost some partitions didn't have parent RDD(like the RDD is from spark streaming receiver) ?

like image 985
Gary Gauh Avatar asked Oct 30 '22 02:10

Gary Gauh


1 Answers

What if we lose something part way through computation?

  • Rely on the key insight from MR! Determinism provides safe recompute.
  • Track 'lineage' of each RDD. Can recompute from parents if needed.
  • Interesting: only need to record tiny state to do recompute.

    Need parent pointer, function applied, and a few other bits.
    Log 10 KB per transform rather than re-output 1 TB -> 2 TB
    

Source

The child RDD is metadata that describes how to calculate the RDD from the parent RDD. Read more in What is RDD dependency in Spark?

like image 166
gsamaras Avatar answered Nov 15 '22 10:11

gsamaras