I've setup a Flink 1.2 standalone cluster with 2 JobManagers and 3 TaskManagers and I'm using JMeter to load-test it by producing Kafka messages / events which are then processed. The processing job runs on a TaskManager and it usually takes ~15K events/s. The job has set EXACTLY_ONCE checkpointing and is persisting state and checkpoints to Amazon S3. If I shutdown the TaskManager running the job it takes a bit, a few seconds, then the job is resumed on a different TaskManager. The job mainly logs the event ids which are consecutive integers (e.g. from 0 to 1200000). When I check the output on the TaskManager I shut down the last count is for example 500000, then when I check the output on the resumed job on a different TaskManager it starts with ~ 400000. This means ~100K of duplicated events. This number is dependent on the speed of the test can be higher or lower. Not sure if I'm missing something but I would expect the job to display the next consecutive number (like 500001) after resuming on the different TaskManager. Does anyone know why this is happening / extra settings I have to configure to obtain the exactly once?

You are seeing the expected behavior for exactly-once. Flink implements fault-tolerance via a combination of checkpointing and replay in the case of failures. The guarantee is not that each event will be sent into the pipeline exactly once, but rather that each event will affect your pipeline's state exactly once. Checkpointing creates a consistent snapshot across the entire cluster. During recovery, operator state is restored and the sources are replayed from the most recent checkpoint. For a more thorough explanation, see this data Artisans blog post: High-throughput, low-latency, and exactly-once stream processing with Apache Flink™, or the Flink docs.

Flink exactly-once message processing

Tags:

apache-flink

flink-streaming

flink-cep

I've setup a Flink 1.2 standalone cluster with 2 JobManagers and 3 TaskManagers and I'm using JMeter to load-test it by producing Kafka messages / events which are then processed. The processing job runs on a TaskManager and it usually takes ~15K events/s.
The job has set EXACTLY_ONCE checkpointing and is persisting state and checkpoints to Amazon S3. If I shutdown the TaskManager running the job it takes a bit, a few seconds, then the job is resumed on a different TaskManager. The job mainly logs the event ids which are consecutive integers (e.g. from 0 to 1200000).
When I check the output on the TaskManager I shut down the last count is for example 500000, then when I check the output on the resumed job on a different TaskManager it starts with ~ 400000. This means ~100K of duplicated events. This number is dependent on the speed of the test can be higher or lower.
Not sure if I'm missing something but I would expect the job to display the next consecutive number (like 500001) after resuming on the different TaskManager.
Does anyone know why this is happening / extra settings I have to configure to obtain the exactly once?

573

asked Apr 16 '17 09:04

razvan

1 Answers

You are seeing the expected behavior for exactly-once. Flink implements fault-tolerance via a combination of checkpointing and replay in the case of failures. The guarantee is not that each event will be sent into the pipeline exactly once, but rather that each event will affect your pipeline's state exactly once.

Checkpointing creates a consistent snapshot across the entire cluster. During recovery, operator state is restored and the sources are replayed from the most recent checkpoint.

For a more thorough explanation, see this data Artisans blog post: High-throughput, low-latency, and exactly-once stream processing with Apache Flink™, or the Flink docs.

159

answered Oct 05 '22 13:10

David Anderson

Related questions
                            
                                Flink slot removed exception
                            
                                flink: applying multiple aggregations on a windowed stream
                            
                                Apache Flink CEP Pattern operation for NOT followedBy
                            
                                Why using apache kafka in real-time processing
                            
                                Apache Flink: How often is state de/serialized?
                            
                                Apache Flink: Using filter() or split() to split a stream?
                            
                                Flink Custom Partition Function
                            
                                Compose-Docker pull specific image:tag from a yml file service
                            
                                Two questions on Flink externalized checkpoints
                            
                                How to increase Flink taskmanager.numberOfTaskSlots to run it without Flink server(in IDE or fat jar)
                            
                                Storage in Apache Flink
                            
                                How to write the content of a Flink var to screen in Zeppelin?
                            
                                Flink dynamic scaling
                            
                                java.lang.NoSuchMethodException for init method in Scala case class
                            
                                Test csv files equality with random line order (Junit)
                            
                                Apache Flink: How to enable "upsert mode" for dynamic tables?
                            
                                Flink window state size and state management
                            
                                Apache Flink example job fails to run with "Job not found"
                            
                                zipWithIndex on Apache Flink
                            
                                Fllink Web UI not displaying records received in a Custom Source implementation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With