What is the difference between spark checkpoint and persist to a disk. Are both these store in the local disk?

There are few important differences but the fundamental one is what happens with lineage. <code>Persist</code> / <code>cache</code> keeps lineage intact while <code>checkpoint</code> breaks lineage. Lets consider following examples: <pre class="prettyprint lang-scala prettyprint-override"><code>import org.apache.spark.storage.StorageLevel val rdd = sc.parallelize(1 to 10).map(x => (x % 3, 1)).reduceByKey(_ + _) </code></pre> <ul> <li> <code>cache</code> / <code>persist</code>: <pre class="prettyprint lang-scala prettyprint-override"><code>val indCache = rdd.mapValues(_ > 4) indCache.persist(StorageLevel.DISK_ONLY) indCache.toDebugString // (8) MapPartitionsRDD[13] at mapValues at <console>:24 [Disk Serialized 1x Replicated] // | ShuffledRDD[3] at reduceByKey at <console>:21 [Disk Serialized 1x Replicated] // +-(8) MapPartitionsRDD[2] at map at <console>:21 [Disk Serialized 1x Replicated] // | ParallelCollectionRDD[1] at parallelize at <console>:21 [Disk Serialized 1x Replicated] indCache.count // 3 indCache.toDebugString // (8) MapPartitionsRDD[13] at mapValues at <console>:24 [Disk Serialized 1x Replicated] // | CachedPartitions: 8; MemorySize: 0.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 587.0 B // | ShuffledRDD[3] at reduceByKey at <console>:21 [Disk Serialized 1x Replicated] // +-(8) MapPartitionsRDD[2] at map at <console>:21 [Disk Serialized 1x Replicated] // | ParallelCollectionRDD[1] at parallelize at <console>:21 [Disk Serialized 1x Replicated] </code></pre> </li> <li> <code>checkpoint</code>: <pre class="prettyprint lang-scala prettyprint-override"><code>val indChk = rdd.mapValues(_ > 4) indChk.checkpoint indChk.toDebugString // (8) MapPartitionsRDD[11] at mapValues at <console>:24 [] // | ShuffledRDD[3] at reduceByKey at <console>:21 [] // +-(8) MapPartitionsRDD[2] at map at <console>:21 [] // | ParallelCollectionRDD[1] at parallelize at <console>:21 [] indChk.count // 3 indChk.toDebugString // (8) MapPartitionsRDD[11] at mapValues at <console>:24 [] // | ReliableCheckpointRDD[12] at count at <console>:27 [] </code></pre> </li> </ul> As you can see, in the first case lineage is preserved even if data is fetched from the cache. It means that data can be recomputed from scratch if some partitions of <code>indCache</code> are lost. In the second case lineage is completely lost after the checkpoint and <code>indChk</code> doesn't carry an information required to rebuild it anymore. <code>checkpoint</code>, unlike <code>cache</code> / <code>persist</code> is computed separately from other jobs. That's why RDD marked for checkpointing should be cached: <blockquote> It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation. </blockquote> Finally <code>checkpointed</code> data is persistent and not removed after <code>SparkContext</code> is destroyed. Regarding data storage <code>SparkContext.setCheckpointDir</code> used by <code>RDD.checkpoint</code> requires <code>DFS</code> path if running in non-local mode. Otherwise it can be local files system as well. <code>localCheckpoint</code> and <code>persist</code> without replication should use local file system. Important Note: RDD checkpointing is a different concept than a chekpointing in Spark Streaming. The former one is designed to address lineage issue, the latter one is all about streaming reliability and failure recovery.

I think you can find a very detailed answer here While it is very hard to summarize all in that page, I will say Persist <ul> <li>Persisting or caching with StorageLevel.DISK_ONLY cause the generation of RDD to be computed and stored in a location such that subsequent use of that RDD will not go beyond that points in recomputing the linage. </li> <li>After persist is called, Spark still remembers the lineage of the RDD even though it doesn't call it.</li> <li>Secondly, after the application terminates, the cache is cleared or file destroyed</li> </ul> Checkpointing <ul> <li>Checkpointing stores the rdd physically to hdfs and destroys the lineage that created it.</li> <li>The checkpoint file won't be deleted even after the Spark application terminated.</li> <li>Checkpoint files can be used in subsequent job run or driver program</li> <li>Checkpointing an RDD causes double computation because the operation will first call a cache before doing the actual job of computing and writing to the checkpoint directory.</li> </ul> You may want to read the article for more of the details or internals of Spark's checkpointing or Cache operations.

What is the difference between spark checkpoint and persist to a disk

2 Answers

There are few important differences but the fundamental one is what happens with lineage. Persist / cache keeps lineage intact while checkpoint breaks lineage. Lets consider following examples:

import org.apache.spark.storage.StorageLevel  val rdd = sc.parallelize(1 to 10).map(x => (x % 3, 1)).reduceByKey(_ + _)

cache / persist:

val indCache  = rdd.mapValues(_ > 4) indCache.persist(StorageLevel.DISK_ONLY)  indCache.toDebugString // (8) MapPartitionsRDD[13] at mapValues at <console>:24 [Disk Serialized 1x Replicated] //  |  ShuffledRDD[3] at reduceByKey at <console>:21 [Disk Serialized 1x Replicated] //  +-(8) MapPartitionsRDD[2] at map at <console>:21 [Disk Serialized 1x Replicated] //     |  ParallelCollectionRDD[1] at parallelize at <console>:21 [Disk Serialized 1x Replicated]  indCache.count // 3  indCache.toDebugString // (8) MapPartitionsRDD[13] at mapValues at <console>:24 [Disk Serialized 1x Replicated] //  |       CachedPartitions: 8; MemorySize: 0.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 587.0 B //  |  ShuffledRDD[3] at reduceByKey at <console>:21 [Disk Serialized 1x Replicated] //  +-(8) MapPartitionsRDD[2] at map at <console>:21 [Disk Serialized 1x Replicated] //     |  ParallelCollectionRDD[1] at parallelize at <console>:21 [Disk Serialized 1x Replicated]

checkpoint:

val indChk  = rdd.mapValues(_ > 4) indChk.checkpoint  indChk.toDebugString // (8) MapPartitionsRDD[11] at mapValues at <console>:24 [] //  |  ShuffledRDD[3] at reduceByKey at <console>:21 [] //  +-(8) MapPartitionsRDD[2] at map at <console>:21 [] //     |  ParallelCollectionRDD[1] at parallelize at <console>:21 []  indChk.count // 3  indChk.toDebugString // (8) MapPartitionsRDD[11] at mapValues at <console>:24 [] //  |  ReliableCheckpointRDD[12] at count at <console>:27 []

As you can see, in the first case lineage is preserved even if data is fetched from the cache. It means that data can be recomputed from scratch if some partitions of indCache are lost. In the second case lineage is completely lost after the checkpoint and indChk doesn't carry an information required to rebuild it anymore.

checkpoint, unlike cache / persist is computed separately from other jobs. That's why RDD marked for checkpointing should be cached:

It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation.

Finally checkpointed data is persistent and not removed after SparkContext is destroyed.

Regarding data storage SparkContext.setCheckpointDir used by RDD.checkpoint requires DFS path if running in non-local mode. Otherwise it can be local files system as well. localCheckpoint and persist without replication should use local file system.

Important Note:

RDD checkpointing is a different concept than a chekpointing in Spark Streaming. The former one is designed to address lineage issue, the latter one is all about streaming reliability and failure recovery.

180

answered Oct 03 '22 07:10

zero323

I think you can find a very detailed answer here

While it is very hard to summarize all in that page, I will say

Persist

Persisting or caching with StorageLevel.DISK_ONLY cause the generation of RDD to be computed and stored in a location such that subsequent use of that RDD will not go beyond that points in recomputing the linage.
After persist is called, Spark still remembers the lineage of the RDD even though it doesn't call it.
Secondly, after the application terminates, the cache is cleared or file destroyed

Checkpointing

Checkpointing stores the rdd physically to hdfs and destroys the lineage that created it.
The checkpoint file won't be deleted even after the Spark application terminated.
Checkpoint files can be used in subsequent job run or driver program
Checkpointing an RDD causes double computation because the operation will first call a cache before doing the actual job of computing and writing to the checkpoint directory.

You may want to read the article for more of the details or internals of Spark's checkpointing or Cache operations.

answered Oct 03 '22 08:10

okmich

Related questions
                            
                                Converting Pandas dataframe into Spark dataframe error
                            
                                How to avoid duplicate columns after join?
                            
                                Why does join fail with "java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]"?
                            
                                Filter df when values matches part of a string in pyspark
                            
                                Apache Spark logging within Scala
                            
                                Provide schema while reading csv file as a dataframe
                            
                                reduceByKey: How does it work internally?
                            
                                Write to multiple outputs by key Spark - one Spark job
                            
                                Spark - SELECT WHERE or filtering?
                            
                                What does setMaster `local[*]` mean in spark?
                            
                                How to perform union on two DataFrames with different amounts of columns in spark?
                            
                                Errors when using OFF_HEAP Storage with Spark 1.4.0 and Tachyon 0.6.4
                            
                                How to check the Spark version
                            
                                How do I skip a header from CSV files in Spark?
                            
                                how to loop through each row of dataFrame in pyspark
                            
                                Spark code organization and best practices [closed]
                            
                                How do I convert an array (i.e. list) column to Vector
                            
                                How to join on multiple columns in Pyspark?
                            
                                How does createOrReplaceTempView work in Spark?
                            
                                Create Spark DataFrame. Can not infer schema for type: <type 'float'>

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the difference between spark checkpoint and persist to a disk

Tags:

apache-spark

nagendra

People also ask

2 Answers

zero323

okmich

Recent Activity

Donate For Us