Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Structured Streaming Checkpoint Cleanup

I am ingesting data from a file source using structured streaming. I have a checkpoint setup and it works correctly as far as I can tell except I don't understand what will happen in a couple situations. If my streaming app runs for a long time will the checkpoint files just continue to become larger forever or is it eventually cleaned up. And does it matter if it is never cleaned up? It seems that eventually it would become large enough that it would take a long time for the program to parse.

My other question is when I manually remove or alter the checkpoint folder, or change to a different checkpoint folder no new files are ingested. The files are recognized and are added to the checkpoint, but the file is not actually ingested. This has me worried that if somehow the checkpoint folder is altered my ingestion will screw up. I haven't been able to find much information on what the correct procedure is in these situations.

like image 313
torpedoted Avatar asked Jan 13 '18 00:01

torpedoted


People also ask

What is checkpoint in structured Streaming?

A checkpoint helps build fault-tolerant and resilient Spark applications. In Spark Structured Streaming, it maintains intermediate state on HDFS compatible file systems to recover from failures.

What is Spark Streaming checkpoint?

What is Spark Streaming Checkpoint. A process of writing received records at checkpoint intervals to HDFS is checkpointing. It is a requirement that streaming application must operate 24/7. Hence, must be resilient to failures unrelated to the application logic such as system failures, JVM crashes, etc.

What is the difference between Spark Streaming and structured Streaming?

Spark receives real-time data and divides it into smaller batches for the execution engine. In contrast, Structured Streaming is built on the SparkSQL API for data stream processing. In the end, all the APIs are optimized using Spark catalyst optimizer and translated into RDDs for execution under the hood.

How do I optimize a Spark for Streaming?

Save the old compact file for backup and for input/output file analysis. Move the new shredded compact file to the spark checkpointing directory and commit directory. Keeping the same file structure and file name. Make sure the spark streaming job is not running, while we are updating the metadata compact file.


1 Answers

If my streaming app runs for a long time will the checkpoint files just continue to become larger forever or is it eventually cleaned up

Structured Streaming keeps a background thread which is responsible for deleting snapshots and deltas of your state, so you shouldn't be concerned about it unless your state is really large and the amount of space you have is small, in which case you can configure the retrained deltas/snapshots Spark stores.

when I manually remove or alter the checkpoint folder, or change to a different checkpoint folder no new files are ingested.

I'm not really sure what you mean here, but you should only remove checkpointed data in special cases. Structured Streaming allows you to keep state between version upgrades as long as the stored data type is backwards compatible. I don't really see a good reason for altering the location of your checkpoint or deleting the files manually unless something bad happened.

like image 91
Yuval Itzchakov Avatar answered Sep 30 '22 14:09

Yuval Itzchakov