Can someone please help me understand the difference between Apache Flink's Checkpoints & Savepoints. While i read the documentation, couldn't understand the difference! :s

Savepoints Savepoints usually apply to an individual transaction; it marks a point to which the transaction can be rolled back, so subsequent changes can be undone if necessary. More See Here https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/cli.html#savepoints Checkpoints Checkpoints usually apply to whole systems, You can configure periodic checkpoints to be persisted externally. Externalized checkpoints write their meta data out to persistent storage and are not automatically cleaned up when the job fails. More See Here: https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/checkpoints.html

Apache Flink - Difference between Checkpoints & Save points?

2 Answers

Apache Flink's Checkpoints and Savepoints are similar in that way they both are mechanisms for preserving internal state of Flink's applications.

Checkpoints are taken automatically and are used for automatic restarting job in case of a failure.

Savepoints on the other hand are taken manually, are always stored externally and are used for starting a "new" job with previous internal state in case of e.g.

bug fixing
flink version upgrade
A/B testing, etc.

Underneath they are in fact the same mechanism/code path with some subtle nuances.

Edit:

You can also find a very good explanation in the official documentation https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#what-is-a-savepoint-how-is-a-savepoint-different-from-a-checkpoint :

A Savepoint is a consistent image of the execution state of a streaming job, created via Flink’s checkpointing mechanism. You can use Savepoints to stop-and-resume, fork, or update your Flink jobs. Savepoints consist of two parts: a directory with (typically large) binary files on stable storage (e.g. HDFS, S3, …) and a (relatively small) meta data file. The files on stable storage represent the net data of the job’s execution state image. The meta data file of a Savepoint contains (primarily) pointers to all files on stable storage that are part of the Savepoint, in form of absolute paths. Attention: In order to allow upgrades between programs and Flink versions, it is important to check out the following section about assigning IDs to your operators.

Conceptually, Flink’s Savepoints are different from Checkpoints in a similar way that backups are different from recovery logs in traditional database systems. The primary purpose of Checkpoints is to provide a recovery mechanism in case of unexpected job failures. A Checkpoint’s lifecycle is managed by Flink, i.e. a Checkpoint is created, owned, and released by Flink - without user interaction. As a method of recovery and being periodically triggered, two main design goals for the Checkpoint implementation are i) being as lightweight to create and ii) being as fast to restore from as possible. Optimizations towards those goals can exploit certain properties, e.g. that the job code doesn’t change between the execution attempts. Checkpoints are usually dropped after the job was terminated by the user (except if explicitly configured as retained Checkpoints).

In contrast to all this, Savepoints are created, owned, and deleted by the user. Their use-case is for planned, manual backup and resume. For example, this could be an update of your Flink version, changing your job graph, changing parallelism, forking a second job like for a red/blue deployment, and so on. Of course, Savepoints must survive job termination. Conceptually, Savepoints can be a bit more expensive to produce and restore and focus more on portability and support for the previously mentioned changes to the job.

Those conceptual differences aside, the current implementations of Checkpoints and Savepoints are basically using the same code and produce the same format. However, there is currently one exception from this, and we might introduce more differences in the future. The exception are incremental checkpoints with the RocksDB state backend. They are using some RocksDB internal format instead of Flink’s native savepoint format. This makes them the first instance of a more lightweight checkpointing mechanism, compared to Savepoints.

102

answered Oct 23 '22 13:10

Dawid Wysakowicz

Savepoints

Savepoints usually apply to an individual transaction; it marks a point to which the transaction can be rolled back, so subsequent changes can be undone if necessary.

More See Here

https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/cli.html#savepoints

Checkpoints

Checkpoints usually apply to whole systems, You can configure periodic checkpoints to be persisted externally. Externalized checkpoints write their meta data out to persistent storage and are not automatically cleaned up when the job fails. More See Here:

https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/checkpoints.html

answered Oct 23 '22 13:10

R Palanivel-Tamilnadu India

Related questions
                            
                                Task distribution in Apache Flink
                            
                                Flink: Sharing state in CoFlatMapFunction
                            
                                Flink cluster configuration issue - no slots available
                            
                                Apache Spark Structured Streaming vs Apache Flink: what is the difference?
                            
                                Run Apache Flink with Amazon S3
                            
                                How to count unique words in a stream?
                            
                                How to Reference the External Jar in Flink
                            
                                Flink Scala API "not enough arguments"
                            
                                flink - using dagger injections - not serializable?
                            
                                Apache Beam Counter/Metrics not available in Flink WebUI
                            
                                How can I start the Flink job manager web interface when running Flink from an IDE
                            
                                How does Apache Flink compare to Mapreduce on Hadoop?
                            
                                Apache Flink: java.lang.NoClassDefFoundError
                            
                                How to combine streaming data with large history data set in Dataflow/Beam
                            
                                How to look up and update the state of a record from a database in Apache Flink?
                            
                                Flink webui when running from IDE
                            
                                Kafka Client Timeout of 60000ms expired before the position for partition could be determined
                            
                                Spark vs Flink low memory available
                            
                                Could not resolve substitution to a value: ${akka.stream.materializer} in AWS Lambda
                            
                                Some puzzles for the operator Parallelism in Flink

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Apache Flink - Difference between Checkpoints & Save points?

Tags:

apache-flink

Raja

People also ask

2 Answers

Dawid Wysakowicz

R Palanivel-Tamilnadu India

Recent Activity

Donate For Us