Many say:
Spark does not replicate data in hdfs.
Spark arranges the operations in DAG graph.Spark builds RDD lineage. If a RDD is lost they can be rebuilt with the help of lineage graph. So there is no need of data replication as the RDDS can be recalculated from the lineage graph.
And my question is:
If a node fails, spark will only recompute the RDD partitions lost on this node, but where does the data source needed in the recompution process come from ? Do you mean its parent RDD is still there when the node fails?What if the RDD that lost some partitions didn't have parent RDD(like the RDD is from spark streaming receiver) ?
What if we lose something part way through computation?
Interesting: only need to record tiny state to do recompute.
Need parent pointer, function applied, and a few other bits.
Log 10 KB per transform rather than re-output 1 TB -> 2 TB
Source
The child RDD is metadata that describes how to calculate the RDD from the parent RDD. Read more in What is RDD dependency in Spark?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With