When I run code such as the following: <pre class="prettyprint"><code>val newRDD = prevRDD.map(a => (a._1, 1L)).distinct.persist(StorageLevel.MEMORY_AND_DISK_SER) newRDD.checkpoint print(newRDD.count()) </code></pre> and watch the stages in Yarn, I notice that Spark is doing the DAG calculation TWICE -- once for the distinct+count that materializes the RDD and caches it, and then a completely SECOND time to created the checkpointed copy. Since the RDD is already materialized and cached, why doesn't the checkpointing simply take advantage of this, and save the cached partitions to disk? Is there an existing way (some kind of configuration setting or code change) to force Spark to take advantage of this and only run the operation ONCE, and checkpointing will just copy things? Do I need to "materialize" twice, instead? <pre class="prettyprint"><code>val newRDD = prevRDD.map(a => (a._1, 1L)).distinct.persist(StorageLevel.MEMORY_AND_DISK_SER) print(newRDD.count()) newRDD.checkpoint print(newRDD.count()) </code></pre> I've created an Apache Spark Jira ticket to make this a feature request: https://issues.apache.org/jira/browse/SPARK-8666

This is an old question. But it affected me as well so I did some digging. I found a bunch of very unhelpful search results within the change-tracking history for jira and github. These search results contained a lot of tech-babble from the developers about their proposed programming changes. That didn't end up being very informative for me, and I would suggest limiting the amount of time you spend looking at it. The clearest information I could find on the matter is here: https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/6-CacheAndCheckpoint.md <blockquote> An RDD which needs to be checkpointed will be computed twice; thus it is suggested to do a rdd.cache() before rdd.checkpoint() </blockquote> Given that the OP actually did use persist and checkpoint, he was probably on the right track. I suspect the only problem was in the way he invoked checkpoint. I'm fairly new to spark but I think he should have done it like so: <blockquote> newRDD = newRDD.checkpoint </blockquote> Hope this is clear. Based on my testing, this eliminates the redundant recomputation of one of my dataframes.

Spark RDD checkpoint on persisted/cached RDDs are performing the DAG twice

Tags:

caching

persist

apache-spark

rdd

checkpoint

When I run code such as the following:

val newRDD = prevRDD.map(a => (a._1, 1L)).distinct.persist(StorageLevel.MEMORY_AND_DISK_SER)
newRDD.checkpoint
print(newRDD.count())

and watch the stages in Yarn, I notice that Spark is doing the DAG calculation TWICE -- once for the distinct+count that materializes the RDD and caches it, and then a completely SECOND time to created the checkpointed copy.

Since the RDD is already materialized and cached, why doesn't the checkpointing simply take advantage of this, and save the cached partitions to disk?

Is there an existing way (some kind of configuration setting or code change) to force Spark to take advantage of this and only run the operation ONCE, and checkpointing will just copy things?

Do I need to "materialize" twice, instead?

val newRDD = prevRDD.map(a => (a._1, 1L)).distinct.persist(StorageLevel.MEMORY_AND_DISK_SER)
print(newRDD.count())

newRDD.checkpoint
print(newRDD.count())

I've created an Apache Spark Jira ticket to make this a feature request: https://issues.apache.org/jira/browse/SPARK-8666

772

asked Jun 26 '15 16:06

Glenn Strycker

2 Answers

Looks like this may be a known issue. See an older JIRA ticket, https://issues.apache.org/jira/browse/SPARK-8582

answered Oct 20 '22 02:10

Glenn Strycker

This is an old question. But it affected me as well so I did some digging. I found a bunch of very unhelpful search results within the change-tracking history for jira and github. These search results contained a lot of tech-babble from the developers about their proposed programming changes. That didn't end up being very informative for me, and I would suggest limiting the amount of time you spend looking at it.

The clearest information I could find on the matter is here: https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/6-CacheAndCheckpoint.md

An RDD which needs to be checkpointed will be computed twice; thus it is suggested to do a rdd.cache() before rdd.checkpoint()

Given that the OP actually did use persist and checkpoint, he was probably on the right track. I suspect the only problem was in the way he invoked checkpoint. I'm fairly new to spark but I think he should have done it like so:

newRDD = newRDD.checkpoint

Hope this is clear. Based on my testing, this eliminates the redundant recomputation of one of my dataframes.

answered Oct 20 '22 01:10

David Beavon

Related questions
                            
                                My specific installation of Safari refuses to open my website
                            
                                SystemJS versioning for production and cache management (requirejs urlArgs alternative)
                            
                                Are browsers supposed to handle 304 responses automagically?
                            
                                Implementing cache correctly in a class library for use in an asp.net application
                            
                                Removing specific items from Django's cache?
                            
                                Asp.NET MVC Html.TextBox refresh problem
                            
                                Caching images with different query strings (S3 signed urls)
                            
                                Prevent Rails 2/3 from caching of Lib/ Classes
                            
                                Making Cache access methods static
                            
                                Cache using PHP cURL
                            
                                When does java thread cache refresh happens?
                            
                                using singleton for caching
                            
                                And still, what is the magic of ASP.NET MVC Content folder?
                            
                                Rails fragment cache testing with RSpec
                            
                                Spring 3.1 Cache Abstraction without parameters
                            
                                Force image caching with javascript
                            
                                Update an html file so that the browser knows not use the one in the cache
                            
                                101 ways to clear a WebView cache - all of which don't work
                            
                                Spring cache for a given request
                            
                                Woff file mime type and Azure

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark RDD checkpoint on persisted/cached RDDs are performing the DAG twice

Tags:

caching

persist

apache-spark

rdd

checkpoint

Glenn Strycker

People also ask

2 Answers

Glenn Strycker

David Beavon

Recent Activity

Donate For Us