Misunderstanding of spark RDD fault tolerant

Question

Many say:

Spark does not replicate data in hdfs.

Spark arranges the operations in DAG graph.Spark builds RDD lineage. If a RDD is lost they can be rebuilt with the help of lineage graph. So there is no need of data replication as the RDDS can be recalculated from the lineage graph.

And my question is:

If a node fails, spark will only recompute the RDD partitions lost on this node, but where does the data source needed in the recompution process come from ? Do you mean its parent RDD is still there when the node fails?What if the RDD that lost some partitions didn't have parent RDD（like the RDD is from spark streaming receiver） ?

gsamaras · Accepted Answer

What if we lose something part way through computation?

Rely on the key insight from MR! Determinism provides safe recompute.
Track 'lineage' of each RDD. Can recompute from parents if needed.

Interesting: only need to record tiny state to do recompute.

Need parent pointer, function applied, and a few other bits.
Log 10 KB per transform rather than re-output 1 TB -> 2 TB

Source

The child RDD is metadata that describes how to calculate the RDD from the parent RDD. Read more in What is RDD dependency in Spark?

Misunderstanding of spark RDD fault tolerant

Tags:

fault-tolerance

distributed-computing

apache-spark

rdd

spark-streaming

Gary Gauh

1 Answers

gsamaras

Recent Activity

Donate For Us

Misunderstanding of spark RDD fault tolerant

Tags:

fault-tolerance

distributed-computing

apache-spark

rdd

spark-streaming

Gary Gauh

1 Answers

gsamaras

Related questions

Recent Activity

Donate For Us