Spark Structured Streaming Checkpoint Cleanup

Tags:

I am ingesting data from a file source using structured streaming. I have a checkpoint setup and it works correctly as far as I can tell except I don't understand what will happen in a couple situations. If my streaming app runs for a long time will the checkpoint files just continue to become larger forever or is it eventually cleaned up. And does it matter if it is never cleaned up? It seems that eventually it would become large enough that it would take a long time for the program to parse.

My other question is when I manually remove or alter the checkpoint folder, or change to a different checkpoint folder no new files are ingested. The files are recognized and are added to the checkpoint, but the file is not actually ingested. This has me worried that if somehow the checkpoint folder is altered my ingestion will screw up. I haven't been able to find much information on what the correct procedure is in these situations.

313

asked Jan 13 '18 00:01

torpedoted

1 Answers

If my streaming app runs for a long time will the checkpoint files just continue to become larger forever or is it eventually cleaned up

Structured Streaming keeps a background thread which is responsible for deleting snapshots and deltas of your state, so you shouldn't be concerned about it unless your state is really large and the amount of space you have is small, in which case you can configure the retrained deltas/snapshots Spark stores.

when I manually remove or alter the checkpoint folder, or change to a different checkpoint folder no new files are ingested.

I'm not really sure what you mean here, but you should only remove checkpointed data in special cases. Structured Streaming allows you to keep state between version upgrades as long as the stored data type is backwards compatible. I don't really see a good reason for altering the location of your checkpoint or deleting the files manually unless something bad happened.

answered Sep 30 '22 14:09

Yuval Itzchakov

Related questions
                            
                                Is Apache Spark good for lots of small, fast computations and a few big, non-interactive ones?
                            
                                spark graphx: how to travers a graph to create a graph of second degree neighbors
                            
                                Running Spark on YARN in yarn-cluster mode: Where does the console output go?
                            
                                Spark CollectAsMap
                            
                                Performing lookup/translation in a Spark RDD or data frame using another RDD/df
                            
                                Why does my Spark run slower than pure Python? Performance comparison
                            
                                How to define a global read\write variables in Spark
                            
                                Why do we need kafka to feed data to apache spark
                            
                                How to insert spark structured streaming DataFrame to Hive external table/location?
                            
                                Spark (Scala) filter array of structs without explode
                            
                                Pure Java/Scala code for writing Tensorflow TFRecords data file
                            
                                Spark: saveAsTextFile without compression
                            
                                Encode an ADT / sealed trait hierarchy into Spark DataSet column
                            
                                where does df.cache() is stored
                            
                                How to set up Spark with Zookeeper for HA?
                            
                                Error in running job on Spark 1.4.0 with Jackson module with ScalaObjectMapper
                            
                                Is reading a CSV file from S3 into a Spark dataframe expected to be so slow?
                            
                                How to set a custom environment variable in EMR to be available for a spark Application
                            
                                How to list all tables in database using Spark SQL?
                            
                                Spark Streaming: Micro batches Parallel Execution

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark Structured Streaming Checkpoint Cleanup

Tags:

apache-spark

spark-structured-streaming

torpedoted

People also ask

1 Answers

Yuval Itzchakov

Recent Activity

Donate For Us