I am working on a Spark ML pipeline where we get OOM Errors on larger data sets. Before training we were using <code>cache()</code>; I swapped this out for <code>checkpoint()</code> and our memory requirements went down significantly. However, in the docs for <code>RDD</code>'s <code>checkpoint()</code> it says: <blockquote> It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation. </blockquote> The same guidance is not given for <code>DataSet</code>'s checkpoint, which is what I am using. Following the above advice anyways, I found that the memory requirements actually increased slightly from using <code>cache()</code> alone. My expectation was that when we do <pre class="prettyprint"><code>... ds.cache() ds.checkpoint() ... </code></pre> the call to checkpoint forces evaluation of the <code>DataSet</code>, which is cached at the same time before being checkpointed. Afterwards, any reference to <code>ds</code> would reference the cached partitions, and if more memory is required and the partitions are evacuated that the checkpointed partitions will be used rather than re-evaluating them. Is this true, or does something different happen under the hood? Ideally I'd like to keep the DataSet in memory if possible, but it seems there is no benefit whatsoever from a memory standpoint to using the cache and checkpoint approach.

TL;DR You won't benefit from in-memory cache (default storage level for <code>Dataset</code> is <code>MEMORY_AND_DISK</code> anyway) in subsequent actions, but you should still consider caching, if computing <code>ds</code> is expensive. Explanation Your expectation that <blockquote> <pre class="prettyprint"><code>ds.cache() ds.checkpoint() ... </code></pre> the call to checkpoint forces evaluation of the DataSet </blockquote> is correct. <code>Dataset.checkpoint</code> comes in different flavors, which allow for both eager and lazy checkpointing, and the default variant is eager <pre class="prettyprint"><code>def checkpoint(): Dataset[T] = checkpoint(eager = true, reliableCheckpoint = true) </code></pre> Therefore subsequent actions should reuse checkpoint files. However, under the covers Spark simply applies <code>checkpoint</code> on the internal <code>RDD</code>, so the rules of evaluation didn't change. Spark evaluates action first, and then creates <code>checkpoint</code> (that's why caching was recommended in the first place). So if you omit <code>ds.cache()</code> <code>ds</code> will be evaluated twice in <code>ds.checkpoint()</code>: <ul> <li>Once for internal <code>count</code>.</li> <li>Once for actual <code>checkpoint</code>.</li> </ul> Therefore nothing changed and <code>cache</code> is still recommended, although recommendation might might slightly weaker, compared to plain <code>RDD</code>, as <code>Dataset</code> cache is considered computationally expensive, and depending on the context, it might be cheaper to simply reload the data (note that <code>Dataset.count</code> without <code>cache</code> is normally optimized, while <code>Dataset.count</code> with <code>cache</code> is not - Any performance issues forcing eager evaluation using count in spark?).

Should cache and checkpoint be used together on DataSets? If so, how does this work under the hood?

Tags:

apache-spark

apache-spark-sql

apache-spark-dataset

I am working on a Spark ML pipeline where we get OOM Errors on larger data sets. Before training we were using cache(); I swapped this out for checkpoint() and our memory requirements went down significantly. However, in the docs for RDD's checkpoint() it says:

It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation.

The same guidance is not given for DataSet's checkpoint, which is what I am using. Following the above advice anyways, I found that the memory requirements actually increased slightly from using cache() alone.

My expectation was that when we do

...
ds.cache()
ds.checkpoint()
...

the call to checkpoint forces evaluation of the DataSet, which is cached at the same time before being checkpointed. Afterwards, any reference to ds would reference the cached partitions, and if more memory is required and the partitions are evacuated that the checkpointed partitions will be used rather than re-evaluating them. Is this true, or does something different happen under the hood? Ideally I'd like to keep the DataSet in memory if possible, but it seems there is no benefit whatsoever from a memory standpoint to using the cache and checkpoint approach.

311

asked Jun 21 '19 22:06

oirectine

Video Answer

1 Answers

TL;DR You won't benefit from in-memory cache (default storage level for Dataset is MEMORY_AND_DISK anyway) in subsequent actions, but you should still consider caching, if computing ds is expensive.

Explanation

Your expectation that

ds.cache()
ds.checkpoint()
...
the call to checkpoint forces evaluation of the DataSet

is correct. Dataset.checkpoint comes in different flavors, which allow for both eager and lazy checkpointing, and the default variant is eager

def checkpoint(): Dataset[T] = checkpoint(eager = true, reliableCheckpoint = true)

Therefore subsequent actions should reuse checkpoint files.

However, under the covers Spark simply applies checkpoint on the internal RDD, so the rules of evaluation didn't change. Spark evaluates action first, and then creates checkpoint (that's why caching was recommended in the first place).

So if you omit ds.cache() ds will be evaluated twice in ds.checkpoint():

Once for internal count.
Once for actual checkpoint.

Therefore nothing changed and cache is still recommended, although recommendation might might slightly weaker, compared to plain RDD, as Dataset cache is considered computationally expensive, and depending on the context, it might be cheaper to simply reload the data (note that Dataset.count without cache is normally optimized, while Dataset.count with cache is not - Any performance issues forcing eager evaluation using count in spark?).

answered Sep 27 '22 20:09

user10938362

Related questions
                            
                                Spark how to use a UDF with a Join
                            
                                How to validate Spark SQL expression without executing it?
                            
                                how to process data in chunks/batches with kafka streams?
                            
                                Spark: UDF executed many times
                            
                                Problems when writing parquet with timestamps prior to 1900 in AWS Glue 3.0
                            
                                How do you perform blocking IO in apache spark job?
                            
                                How to convert matrix to RDD[Vector] in spark
                            
                                java.lang.NoSuchMethodError Jackson databind and Spark
                            
                                Hadoop 2.6 Connecting to ResourceManager at /0.0.0.0:8032
                            
                                Apply function to each row of Spark DataFrame
                            
                                Multiple Spark applications with HiveContext
                            
                                How to optimize spark sql to run it in parallel
                            
                                snakeyaml and spark results in an inability to construct objects
                            
                                Reading in multiple files compressed in tar.gz archive into Spark [duplicate]
                            
                                Spark is not using all configured memory
                            
                                Why Does Spark Query (Load) from Oracle Is So Slow Comparing to SQOOP?
                            
                                Livy Server: return a dataframe as JSON?
                            
                                Online learning of LDA model in Spark
                            
                                Can Spark read data directly into a nested case class?
                            
                                Using airflow to run spark streaming jobs?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With