If the one partition is lost, we can use lineage to reconstruct it. Will the base RDD be loaded again?

Tags:

I read the paper "Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing". The author said that if the one partition is lost, we can use lineage to reconstruct it. However, the origin RDD didn't exist in memory now. So will the base RDD be loaded again to rebuild the lost RDD partition?

576

asked Aug 03 '15 02:08

zickr sivolin

1 Answers

Yes, as you mentioned if the RDD that was used to create the partition is not in memory anymore it has to be loaded again from disk and recomputed. If the original RDD that was used to create your current partition also isn't there (neither in memory or on disk) then Spark will have to go one step back again and recompute the previous RDD. In the worst case scenario Spark will have to go all the way back to the original data.

If you are having long lineage chains like the one described above as the worst case scenario that might mean long re-computation times, that's when you should consider using checkpointing which stores intermediate results in reliable storage (like HDFS) which would prevent Spark from going all the way back to the original data source and use the checkpointed data instead.

@Comment: I'm having problems finding any official reference material but from what I remember from their codebase Spark only recreates the part of data that got lost.

answered Sep 22 '22 16:09

Mateusz Dymczyk

Related questions
                            
                                Spark Structured Streaming Checkpoint Compatibility
                            
                                What can cause a stage to reattempt in Spark
                            
                                Zeppelin does not display stack trace
                            
                                Using .where() on pyspark.sql.functions.max().over(window) on Spark 2.4 throws Java exception
                            
                                Rerun Scala code with -deprecation using Apache Zeppelin
                            
                                one-hot encode of multiple string categorical features using Spark DataFrames
                            
                                Getting error while reading from S3 server using pyspark : [java.lang.IllegalArgumentException]
                            
                                Spark/k8s: How to run spark submit on Kubernetes with client mode
                            
                                Aggregate while dropping duplicates in pyspark
                            
                                Spark not ignoring empty partitions
                            
                                Low parallelism when running Apache Beam wordcount pipeline on Spark with Python SDK
                            
                                How to run a Spark-java program from command line [closed]
                            
                                Apache Spark Throws java.lang.IllegalStateException: unread block data
                            
                                Spark Standalone Mode multiple shell sessions (applications)
                            
                                Specifying the output file name in Apache Spark
                            
                                Spark - convert string IDs to unique integer IDs
                            
                                Usage of local variables in closures when accessing Spark RDDs
                            
                                How do you read and write from/into different ElasticSearch clusters using spark and elasticsearch-hadoop?
                            
                                How to format data for the spark mlib kmeans clustering algorithm?
                            
                                How to extract complex JSON structures using Apache Spark 1.4.0 Data Frames

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

If the one partition is lost, we can use lineage to reconstruct it. Will the base RDD be loaded again?

Tags:

apache-spark

rdd

zickr sivolin

People also ask

1 Answers

Mateusz Dymczyk

Recent Activity

Donate For Us