Spark: efficiency of dataframe checkpoint vs. explicitly writing to disk

Tags:

Checkpoint version:

val savePath = "/some/path"
spark.sparkContext.setCheckpointDir(savePath)
df.checkpoint()

Write to disk version:

df.write.parquet(savePath)
val df = spark.read.parquet(savePath)

I think both break the lineage in the same way.

In my experiments checkpoint is almost 30 bigger on disk than parquet (689GB vs. 24GB). In terms of running time, checkpoint takes 1.5 times longer (10.5 min vs 7.5 min).

Considering all this, what would be the point of using checkpoint instead of saving to file? Am I missing something?

821

asked Aug 09 '18 17:08

germanium

1 Answers

Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed (HDFS) or local file system. If you have a large RDD lineage graph and you want freeze the content of the current RDD i.e. materialize the complete RDD before proceeding to the next step, you generally use persist or checkpoint. The checkpointed RDD then could be used for some other purpose.

When you checkpoint the RDD is serialized and stored in Disk. It doesn't store in parquet format so the data is not properly storage optimized in the Disk. Contraty to parquet which provides various compaction and encoding to store optimize the data. This would explain the difference in the Size.

You should definitely think about checkpointing in a noisy cluster. A cluster is called noisy if there are lots of jobs and users which compete for resources and there are not enough resources to run all the jobs simultaneously.
You must think about checkpointing if your computations are really expensive and take long time to finish because it could be faster to write an RDD to HDFS and read it back in parallel than recompute from scratch.

And there's a slight inconvenience prior to spark2.1 release; there is no way to checkpoint a dataframe so you have to checkpoint the underlying RDD. This issue has been resolved in spark2.1 and above versions.

The problem with saving to Disk in parquet and read it back is that

It could be inconvenient in coding. You need to save and read multiple times.
It could be a slower process in the overall performance of the job. Because when you save as parquet and read it back the Dataframe needs to be reconstructed again.

This wiki could be useful for further investigation

As presented in the dataset checkpointing wiki

Checkpointing is actually a feature of Spark Core (that Spark SQL uses for distributed computations) that allows a driver to be restarted on failure with previously computed state of a distributed computation described as an RDD. That has been successfully used in Spark Streaming - the now-obsolete Spark module for stream processing based on RDD API.

Checkpointing truncates the lineage of a RDD to be checkpointed. That has been successfully used in Spark MLlib in iterative machine learning algorithms like ALS.

Dataset checkpointing in Spark SQL uses checkpointing to truncate the lineage of the underlying RDD of a Dataset being checkpointed.

136

answered Oct 18 '22 05:10

Avishek Bhattacharya

Related questions
                            
                                Spark2.2.1 incompatible Jackson version 2.8.8
                            
                                Why "case class" doesn't need "new" to create a new object
                            
                                How to unzip a zip file using scala?
                            
                                scala 2 dimensional array
                            
                                Is there a way to handle the last case differently in a Scala for loop?
                            
                                Scala: difference between a typeclass and an ADT?
                            
                                How to call a method n times in Scala?
                            
                                Code to enumerate permutations in Scala
                            
                                coin change algorithm in scala using recursion
                            
                                Testing with probabilistic failure of components in Akka (Scala)
                            
                                Spark Random Forests: Different results with same seed
                            
                                Polymorphic updates in an immutable class hierarchy
                            
                                Scala / Slick, "Timeout after 20000ms of waiting for a connection" error
                            
                                scala actors vs threads and blocking IO
                            
                                What is the best way to perform OAuth2 authentication using akka-http?
                            
                                Optimizing Slick generated SQL query
                            
                                (SBT) How to disable default resolver and only use the company internal resolver?
                            
                                JVM OutOfMemory error "death spiral" (not memory leak)
                            
                                Enumeration concept in Scala - Which option to take?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark: efficiency of dataframe checkpoint vs. explicitly writing to disk

Tags:

scala

apache-spark

apache-spark-sql

germanium

People also ask

1 Answers

Avishek Bhattacharya

Recent Activity

Donate For Us