Recently I saw some strange behaviour of Spark. I have a pipeline in my application in which I'm manipulating one big Dataset - pseudocode: <pre class="prettyprint lang-scala prettyprint-override"><code>val data = spark.read (...) data.join(df1, "key") //etc, more transformations data.cache(); // used to not recalculate data after save data.write.parquet() // some save val extension = data.join (..) // more transformations - joins, selects, etc. extension.cache(); // again, cache to not double calculations extension.count(); // (1) extension.write.csv() // some other save extension.groupBy("key").agg(some aggregations) // extension.write.parquet() // other save, without cache it will trigger recomputation of whole dataset </code></pre> However when I call <code>data.unpersist()</code> i.e. in place <code>(1)</code>, Spark deletes from Storage all datasets, also the <code>extension</code> Dataset which is not the dataset I tried to unpersist. Is that an expected behaviour? How can I free some memory by <code>unpersist</code> on old Dataset without unpersisting all Dataset that was "next in chain"? My setup: <ul> <li>Spark version: current master, RC for 2.3</li> <li>Scala: 2.11</li> <li>Java: OpenJDK 1.8</li> </ul> Question looks similar to Understanding Spark's caching, but here I'm doing some actions before unpersist. At first I'm counting everything and then save into storage - I don't know if caching works the same in RDD like in Datasets

This is an expected behavior from spark caching. Spark doesn't want to keep invalid cache data. It completely removes all the cached plans refer to the dataset. This is to make sure the query is correct. In the example you are creating extension <code>dataset</code> from cached dataset <code>data</code>. Now if the dataset <code>data</code> is unpersisted essentially extension dataset can no longer rely on the cached dataset <code>data</code>. Here is the Pull request for the fix they made. You can see similar JIRA ticket

Answer for Spark 2.4: There was a ticket about correctness in Datasets and caching behaviour, see https://issues.apache.org/jira/browse/SPARK-24596 From Maryann Xue description, now caching will work in following manner: <ol> <li>Drop tables and regular (persistent) views: regular mode</li> <li>Drop temporary views: non-cascading mode</li> <li>Modify table contents (INSERT/UPDATE/MERGE/DELETE): regular mode</li> <li>Call DataSet.unpersist(): non-cascading mode</li> <li>Call Catalog.uncacheTable(): follow the same convention as drop tables/view, which is, use non-cascading mode for temporary views and regular mode for the rest</li> </ol> Where "regular mode" means mdoe from the questions and @Avishek's answer and non-cascading mode means, that <code>extension</code> won't be unpersisted

Spark' Dataset unpersist behaviour

Tags:

apache-spark

apache-spark-sql

Recently I saw some strange behaviour of Spark.

I have a pipeline in my application in which I'm manipulating one big Dataset - pseudocode:

val data = spark.read (...)
data.join(df1, "key") //etc, more transformations
data.cache(); // used to not recalculate data after save
data.write.parquet() // some save

val extension = data.join (..) // more transformations - joins, selects, etc.
extension.cache(); // again, cache to not double calculations
extension.count();
// (1)
extension.write.csv() // some other save

extension.groupBy("key").agg(some aggregations) //
extension.write.parquet() // other save, without cache it will trigger recomputation of whole dataset

However when I call data.unpersist() i.e. in place (1), Spark deletes from Storage all datasets, also the extension Dataset which is not the dataset I tried to unpersist.

Is that an expected behaviour? How can I free some memory by unpersist on old Dataset without unpersisting all Dataset that was "next in chain"?

My setup:

Spark version: current master, RC for 2.3
Scala: 2.11
Java: OpenJDK 1.8

Question looks similar to Understanding Spark's caching, but here I'm doing some actions before unpersist. At first I'm counting everything and then save into storage - I don't know if caching works the same in RDD like in Datasets

682

asked Jan 17 '18 15:01

T. Gawęda

2 Answers

This is an expected behavior from spark caching. Spark doesn't want to keep invalid cache data. It completely removes all the cached plans refer to the dataset.

This is to make sure the query is correct. In the example you are creating extension dataset from cached dataset data. Now if the dataset data is unpersisted essentially extension dataset can no longer rely on the cached dataset data.

Here is the Pull request for the fix they made. You can see similar JIRA ticket

132

answered Sep 22 '22 22:09

Avishek Bhattacharya

Answer for Spark 2.4:

There was a ticket about correctness in Datasets and caching behaviour, see https://issues.apache.org/jira/browse/SPARK-24596

From Maryann Xue description, now caching will work in following manner:

Drop tables and regular (persistent) views: regular mode
Drop temporary views: non-cascading mode
Modify table contents (INSERT/UPDATE/MERGE/DELETE): regular mode
Call DataSet.unpersist(): non-cascading mode
Call Catalog.uncacheTable(): follow the same convention as drop tables/view, which is, use non-cascading mode for temporary views and regular mode for the rest

Where "regular mode" means mdoe from the questions and @Avishek's answer and non-cascading mode means, that extension won't be unpersisted

answered Sep 20 '22 22:09

T. Gawęda

Related questions
                            
                                Object cache on Spark executors
                            
                                How to flatten the data of different data types by using Sparklyr package?
                            
                                How does Apache spark handle python multithread issues?
                            
                                Use schema to convert AVRO messages with Spark to DataFrame
                            
                                Distributed Map in Scala Spark
                            
                                Apache Spark EOF exception
                            
                                How to save and load MLLib model in Apache Spark?
                            
                                Spark Streaming + Kafka: SparkException: Couldn't find leader offsets for Set
                            
                                How to read records in JSON format from Kafka using Structured Streaming?
                            
                                'map-side' aggregation in Spark
                            
                                Spark MLlib LDA, how to infer the topics distribution of a new unseen document?
                            
                                How to convert spark DataFrame to RDD mllib LabeledPoints?
                            
                                Spark simpler value_counts
                            
                                Spark from_json with dynamic schema
                            
                                How to sort within partitions (and avoid sort across the partitions) using RDD API?
                            
                                How to save latest offset that Spark consumed to ZK or Kafka and can read back after restart
                            
                                Create labeledPoints from Spark DataFrame in Python
                            
                                Convert an RDD to iterable: PySpark?
                            
                                How to fully utilize all Spark nodes in cluster?
                            
                                When to use Kryo serialization in Spark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark' Dataset unpersist behaviour

Tags:

apache-spark

apache-spark-sql

T. Gawęda

People also ask

2 Answers

Avishek Bhattacharya

T. Gawęda

Recent Activity

Donate For Us