Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

If the one partition is lost, we can use lineage to reconstruct it. Will the base RDD be loaded again?

I read the paper "Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing". The author said that if the one partition is lost, we can use lineage to reconstruct it. However, the origin RDD didn't exist in memory now. So will the base RDD be loaded again to rebuild the lost RDD partition?

like image 576
zickr sivolin Avatar asked Aug 03 '15 02:08

zickr sivolin


People also ask

What happens if RDD partition is lost due to worker node failure?

If due to a worker node failure any partition of an RDD is lost, then that partition can be re-computed from the original fault-tolerant dataset using the lineage of operations.

In which way the Spark reconstruct the lost partition in memory?

Lineage is an RDD process to reconstruct lost partitions. Spark not replicate the data in memory, if data lost, Rdd use linege to rebuild lost data. Each RDD remembers how the RDD build from other datasets.

Which feature of RDD is used for rebuilding any potential data losses?

Spark RDDs are fault tolerant as they track data lineage information to rebuild lost data automatically on failure. They rebuild lost data on failure using lineage, each RDD remembers how it was created from other datasets (by transformations like a map, join or groupBy) to recreate itself.

What makes RDD resilient?

Most of you might be knowing the full form of RDD, it is Resilient Distributed Datasets. Resilient because RDDs are immutable(can't be modified once created) and fault tolerant, Distributed because it is distributed across cluster and Dataset because it holds data.


1 Answers

Yes, as you mentioned if the RDD that was used to create the partition is not in memory anymore it has to be loaded again from disk and recomputed. If the original RDD that was used to create your current partition also isn't there (neither in memory or on disk) then Spark will have to go one step back again and recompute the previous RDD. In the worst case scenario Spark will have to go all the way back to the original data.

If you are having long lineage chains like the one described above as the worst case scenario that might mean long re-computation times, that's when you should consider using checkpointing which stores intermediate results in reliable storage (like HDFS) which would prevent Spark from going all the way back to the original data source and use the checkpointed data instead.

@Comment: I'm having problems finding any official reference material but from what I remember from their codebase Spark only recreates the part of data that got lost.

like image 64
Mateusz Dymczyk Avatar answered Sep 22 '22 16:09

Mateusz Dymczyk