PySpark: fully cleaning checkpoints

Tags:

apache-spark

pyspark

According the documentation is possible to tell Spark to keep track of "out of scope" checkpoints - those that are not needed anymore - and clean them from disk.

SparkSession.builder
  ...
  .config("spark.cleaner.referenceTracking.cleanCheckpoints", "true")
  .getOrCreate()

Apparently it does so but the problem, however, is that the last checkpointed rdds are never deleted.

Question

Is there any configuration I am missing to perform all cleanse?
If there isn't: Is there any way to get the name of the temporary folder created for a particular application so I can programatically delete it? I.e. Get 0c514fb8-498c-4455-b147-aff242bd7381 from SparkContext the same way you can get the applicationId

238

asked Oct 03 '18 15:10

TMichel

1 Answers

I know its old question but recently i was exploring on checkpoint and had similar problems. Would like to share the findings.

Question :Is there any configuration I am missing to perform all cleanse?

Setting spark.cleaner.referenceTracking.cleanCheckpoints=true is working sometime but its hard to rely on it. official document says that by setting this property

clean checkpoint files if the reference is out of scope

I don't know what exactly it means because my understanding is once spark session/context is stopped it should clean it.

However, I found a answer to your below question

If there isn't: Is there any way to get the name of the temporary folder created for a particular application so I can programatically delete it? I.e. Get 0c514fb8-498c-4455-b147-aff242bd7381 from SparkContext the same way you can get the applicationId

Yes, We can get the checkpointed directory like below:

Scala :

//Set directory
scala> spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoint/")

scala> spark.sparkContext.getCheckpointDir.get
res3: String = hdfs://<name-node:port>/tmp/checkpoint/625034b3-c6f1-4ab2-9524-e48dfde589c3

//It gives String so we can use org.apache.hadoop.fs to delete path

PySpark:

// Set directory
>>> spark.sparkContext.setCheckpointDir('hdfs:///tmp/checkpoint')
>>> t = sc._jsc.sc().getCheckpointDir().get()
>>> t 
u'hdfs://<name-node:port>/tmp/checkpoint/dc99b595-f8fa-4a08-a109-23643e2325ca'

// notice 'u' at the start which means It returns unicode object
// Below are the steps to get hadoop file system object and delete

>>> fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
True

>>> fs.delete(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
True

answered Oct 08 '22 00:10

SMaZ

Related questions
                            
                                Apache SPARK:-Nullpointer Exception on broadcast variables (YARN Cluster mode)
                            
                                Why Spark doesn't allow map-side combining with array keys?
                            
                                How can one list all csv files in an HDFS location within the Spark Scala shell?
                            
                                How to implement NOT IN for two DataFrames with different structure in Apache Spark
                            
                                Converting Map type in Case Class to StructField Type
                            
                                Reading multiple json files from Spark
                            
                                Moving Spark DataFrame from Python to Scala whithn Zeppelin
                            
                                VectorAssembler does not support the StringType type scala spark convert
                            
                                How Spark read file with underline the beginning of the file name?
                            
                                Apache Spark RDD Split "|"
                            
                                Getting exception : java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;) while using data frames
                            
                                Acessing nested columns in pyspark dataframe
                            
                                How to submit multiple Spark applications in parallel without spawning separate JVMs?
                            
                                which is faster in spark, collect() or toLocalIterator()
                            
                                How to set Parquet file encoding in Spark
                            
                                jsontostructs to Row in spark structured streaming
                            
                                How to train a ML model in sparklyr and predict new values on another dataframe?
                            
                                Create new column with an array of range of numbers
                            
                                Spark Dataframe Write to CSV creates _temporary directory file in Standalone Cluster Mode
                            
                                Drop partitions from Spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With