Spark dataframe checkpoint cleanup

Tags:

I have a dataframe in spark where an entire partition from Hive has been loaded and i need to break the lineage to overwrite the same partition after some modifications to the data. However, when the spark job is done i am left with the data from the checkpoint on the HDFS. Why do Spark not clean this up by itself or is there something i am missing?

spark.sparkContext.setCheckpointDir("/home/user/checkpoint/")
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")

val df = spark.table("db.my_table").filter(col("partition").equal(2))

// ... transformations to the dataframe

val checkpointDf = df.checkpoint()
checkpointDf.write.format("parquet").mode(SaveMode.Overwrite).insertInto("db.my_table")

After this i have this file on HDFS:

/home/user/checkpoint/214797f2-ce2e-4962-973d-8f215e5d5dd8/rdd-23/part-00000

And for each time i run the spark job i just get a new directory with a new unique id containing files for each RDD that has been in the dataframes.

891

asked Jan 31 '20 19:01

aweis

1 Answers

Spark has implicit mechanism for checkpoint files cleaning.

Add this property in spark-defaults.conf.

spark.cleaner.referenceTracking.cleanCheckpoints  true #Default is false

You can find more about Spark configuration in Spark official configuration page

If you want to remove the checkpoint directory from HDFS you can remove it with Python, in the end of your script you could use this command rmtree.

This property spark.cleaner.referenceTracking.cleanCheckpoints as true, allows to cleaner to remove old checkpoint files inside the checkpoint directory.

152

answered Oct 29 '22 05:10

ggeop

Related questions
                            
                                Moving Spark DataFrame from Python to Scala whithn Zeppelin
                            
                                @volatile usage unclear - sending an object with a `var` to another thread
                            
                                VectorAssembler does not support the StringType type scala spark convert
                            
                                How to use the free monad with Future[M[_]]
                            
                                How Spark read file with underline the beginning of the file name?
                            
                                How can I enforce compile-time constraints on values for Scala methods?
                            
                                Continue when a Future.failed(new Exception("")) is returned in Scala
                            
                                Create backpressure from a Future inside an Akka stream
                            
                                Apache Spark RDD Split "|"
                            
                                Getting exception : java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;) while using data frames
                            
                                Future[Either[AppError, Option[User]]] in Scala
                            
                                Scala: generic function multiplying Numerics of different types
                            
                                How to convert Scala List to Java ArrayList
                            
                                Nesting CRUD paths in akka-http directives
                            
                                How to set Parquet file encoding in Spark
                            
                                How to validate for nullable types using json schema validator?
                            
                                Create new column with an array of range of numbers
                            
                                Convert Set to cats.data.NonEmptySet?
                            
                                Spark Advanced Window with dynamic last
                            
                                Why the order matters in Occurrences? Coursera-Scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark dataframe checkpoint cleanup

Tags:

scala

apache-spark

hive

aweis

People also ask

1 Answers

ggeop

Recent Activity

Donate For Us