I'm looking for a way to checkpoint DataFrames. Checkpoint is currently an operation on RDD but I can't find how to do it with DataFrames. persist and cache (which are synonyms for each other) are available for DataFrame but they do not "break the lineage" and are thus unsuitable for methods that could loop for hundreds (or thousands) of iterations. As an example, suppose that I have a list of functions whose signature is DataFrame => DataFrame. I want to have a way to compute the following even when myfunctions has hundreds or thousands of entries: <pre class="prettyprint"><code>def foo(dataset: DataFrame, g: DataFrame => Unit) = myfunctions.foldLeft(dataset) { case (df, f) => val nextDF = f(df) g(nextDF) nextDF } </code></pre>

TL;DR: For Spark versions up to 1.6, to actually get a "checkpointed DF", my suggested solution is based on another answer, but with one extra line: <pre class="prettyprint"><code>df.rdd.checkpoint df.rdd.count val df2 = sqlContext.createDataFrame(df.rdd, df.schema) // df2 is checkpointed </code></pre> <hr> Explanation Updated after further research. As pointed out, checkpointing a DataFrame directly is not currently (Spark 1.6.1) possible, though there is an issue for it on Spark's Jira. So, a possible workaround is the one suggested on another answer: <pre class="prettyprint"><code>df.rdd.checkpoint // Assuming the checkpoint dir has already been set df.count // An action to compute the checkpoint </code></pre> However, with this approach, only the df.rdd object will be checkpointed. This can be verified by calling <code>toDebugString</code> to <code>df.rdd</code>: <pre class="prettyprint"><code> scala> df.rdd.toDebugString (32) MapPartitionsRDD[1] at rdd at <console>:38 [] | ReliableCheckpointRDD[2] at count at <console>:38 [] </code></pre> and then calling <code>toDebugString</code> after a quick transformation to <code>df</code> (please note that I created my DataFrame from a JDBC source), returns the following: <pre class="prettyprint"><code>scala> df.withColumn("new_column", lit(0)).rdd.toDebugString res4: String = (32) MapPartitionsRDD[5] at rdd at <console>:38 [] | MapPartitionsRDD[4] at rdd at <console>:38 [] | JDBCRDD[3] at rdd at <console>:38 [] </code></pre> <code>df.explain</code> also shows a hint: <pre class="prettyprint"><code>scala> df.explain == Physical Plan == Scan JDBCRelation (...) </code></pre> So, to actually achieve a "checkpointed" DataFrame, I can only think of creating a new one from the checkpointed RDD: <pre class="prettyprint"><code>val newDF = sqlContext.createDataFrame(df.rdd, df.schema) // or val newDF = df.rdd.map { case Row(val1: Int, ..., valN: Int) => (val1, ..., valN) }.toDF("col1", ..., "colN") </code></pre> Then we can verify that the new DataFrame is "checkpointed": 1) <code>newDF.explain</code>: <pre class="prettyprint"><code>scala> newDF.explain == Physical Plan == Scan PhysicalRDD[col1#5, col2#6, col3#7] </code></pre> 2) <code>newDF.rdd.toDebugString</code>: <pre class="prettyprint"><code>scala> newDF.rdd.toDebugString res7: String = (32) MapPartitionsRDD[10] at rdd at <console>:40 [] | MapPartitionsRDD[8] at createDataFrame at <console>:37 [] | MapPartitionsRDD[1] at rdd at <console>:38 [] | ReliableCheckpointRDD[2] at count at <console>:38 [] </code></pre> 3) With transformation: <pre class="prettyprint"><code>scala> newDF.withColumn("new_column", lit(0)).rdd.toDebugString res9: String = (32) MapPartitionsRDD[12] at rdd at <console>:40 [] | MapPartitionsRDD[11] at rdd at <console>:40 [] | MapPartitionsRDD[8] at createDataFrame at <console>:37 [] | MapPartitionsRDD[1] at rdd at <console>:38 [] | ReliableCheckpointRDD[2] at count at <console>:38 [] </code></pre> Also, I tried some more complex transformations and I was able to check, in practice, that the <code>newDF</code> object was checkpointed. Therefore, the only way I found to reliably checkpoint a DataFrame was by checkpointing its associated RDD and creating a new DataFrame object from it. I hope it helps. Cheers.

How to checkpoint DataFrames?

Tags:

dataframe

scala

apache-spark

apache-spark-sql

I'm looking for a way to checkpoint DataFrames. Checkpoint is currently an operation on RDD but I can't find how to do it with DataFrames. persist and cache (which are synonyms for each other) are available for DataFrame but they do not "break the lineage" and are thus unsuitable for methods that could loop for hundreds (or thousands) of iterations.

As an example, suppose that I have a list of functions whose signature is DataFrame => DataFrame. I want to have a way to compute the following even when myfunctions has hundreds or thousands of entries:

def foo(dataset: DataFrame, g: DataFrame => Unit) =     myfunctions.foldLeft(dataset) {         case (df, f) =>             val nextDF = f(df)             g(nextDF)             nextDF    }

846

asked Oct 29 '15 20:10

Daniel Shields

1 Answers

TL;DR: For Spark versions up to 1.6, to actually get a "checkpointed DF", my suggested solution is based on another answer, but with one extra line:

df.rdd.checkpoint df.rdd.count val df2 = sqlContext.createDataFrame(df.rdd, df.schema) // df2 is checkpointed

Explanation

Updated after further research.

As pointed out, checkpointing a DataFrame directly is not currently (Spark 1.6.1) possible, though there is an issue for it on Spark's Jira.

So, a possible workaround is the one suggested on another answer:

df.rdd.checkpoint // Assuming the checkpoint dir has already been set df.count // An action to compute the checkpoint

However, with this approach, only the df.rdd object will be checkpointed. This can be verified by calling toDebugString to df.rdd:

 scala> df.rdd.toDebugString  (32) MapPartitionsRDD[1] at rdd at <console>:38 []   |   ReliableCheckpointRDD[2] at count at <console>:38 []

and then calling toDebugString after a quick transformation to df (please note that I created my DataFrame from a JDBC source), returns the following:

scala> df.withColumn("new_column", lit(0)).rdd.toDebugString res4: String = (32) MapPartitionsRDD[5] at rdd at <console>:38 []  |   MapPartitionsRDD[4] at rdd at <console>:38 []  |   JDBCRDD[3] at rdd at <console>:38 []

df.explain also shows a hint:

scala> df.explain == Physical Plan == Scan JDBCRelation (...)

So, to actually achieve a "checkpointed" DataFrame, I can only think of creating a new one from the checkpointed RDD:

val newDF = sqlContext.createDataFrame(df.rdd, df.schema) // or val newDF = df.rdd.map {    case Row(val1: Int, ..., valN: Int) => (val1, ..., valN) }.toDF("col1", ..., "colN")

Then we can verify that the new DataFrame is "checkpointed":

1) newDF.explain:

scala> newDF.explain == Physical Plan == Scan PhysicalRDD[col1#5, col2#6, col3#7]

2) newDF.rdd.toDebugString:

scala> newDF.rdd.toDebugString res7: String = (32) MapPartitionsRDD[10] at rdd at <console>:40 []  |   MapPartitionsRDD[8] at createDataFrame at <console>:37 []  |   MapPartitionsRDD[1] at rdd at <console>:38 []  |   ReliableCheckpointRDD[2] at count at <console>:38 []

3) With transformation:

scala> newDF.withColumn("new_column", lit(0)).rdd.toDebugString res9: String = (32) MapPartitionsRDD[12] at rdd at <console>:40 []  |   MapPartitionsRDD[11] at rdd at <console>:40 []  |   MapPartitionsRDD[8] at createDataFrame at <console>:37 []  |   MapPartitionsRDD[1] at rdd at <console>:38 []  |   ReliableCheckpointRDD[2] at count at <console>:38 []

Also, I tried some more complex transformations and I was able to check, in practice, that the newDF object was checkpointed.

Therefore, the only way I found to reliably checkpoint a DataFrame was by checkpointing its associated RDD and creating a new DataFrame object from it.

I hope it helps. Cheers.

131

answered Sep 23 '22 18:09

Daniel de Paula

Related questions
                            
                                Recursive set union: how does it work really?
                            
                                How do I make the "Java Hot Spot MaxPermSize" warning go away when using IntelliJ or Play?
                            
                                Scala catching confusion
                            
                                Scala and Clojure both in one project
                            
                                How to extract best parameters from a CrossValidatorModel
                            
                                Idiomatic Scala Map upsert
                            
                                Scala slick query where in list
                            
                                Specialization of generic functions in Scala (or Java)
                            
                                Cyclomatic complexity of scala [closed]
                            
                                Testing Actors in Akka
                            
                                How to find the number of (key , value) pairs in a map in scala?
                            
                                Try with exception logging
                            
                                How to clear all variables in Scala REPL
                            
                                Migrating Java to Scala
                            
                                How to implement a REST Web Service using Akka?
                            
                                Constructing simple Scala case classes from Strings, strictly without boiler-plate
                            
                                Spark Scala: How to transform a column in a DF
                            
                                List.empty vs. List() vs. new List()
                            
                                Should x._1,x._2... syntax be avoided?
                            
                                Connection pooling in slick?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With