Flink window state size and state management

Tags:

After reading flink's documentation and searching around, i couldn't entirely understand how flink's handles state in its windows. Lets say i have an hourly tumbling window with an aggregation function that accumulate msgs into some java pojo or scala case class. Will The size of that window be tied to the number of events entering that window in a single hour, or will it just be tied to the pojo/case class, as im accumalting the events into that object. (e.g if counting 10000 msgs into an integer, will the size be close to 10000 * msg size or size of an int?) Also, if im using pojos or case classes, does flink handle the state for me (spills to disk if memory exhausted/saves state at check points etc) or must i use flink's state objects for that?

Thanks for your help!

426

asked Mar 19 '19 18:03

yaarix

1 Answers

The state size of a window depends on the type of function that you apply. If you apply a ReduceFunction or AggregateFunction, arriving data is immediately aggregated and the window only holds the aggregated value. If you apply a ProcessWindowFunction or WindowFunction, Flink collects all input records and applies the function when time (event or processing time depending on the window type) passes the window's end time.

You can also combine both types of functions, i.e., have an AggregateFunction followed by a ProcessWindowFunction. In that case, arriving records are immediately aggregated and when the window is closed, the aggregation result is passed as single value to the ProcessWindowFunction. This is useful because you have incremental aggregation (due to ReduceFunction / AggregateFunction) but also access to the window metadata like begin and end timestamp (due to ProcessWindowFunction).

How the state is managed depends on the chosen state backend. If you configure the FsStateBackend all local state is kept on the heap of the TaskManager and the JVM process is killed with an OutOfMemoryError if the state grows too large. If you configure the RocksDBStateBackend state is spilled to disk. This comes with de/serialization costs for every state access but gives much more storage for state.

175

answered Sep 21 '22 16:09

Fabian Hueske

Related questions
                            
                                When to use transient, when not to in flink?
                            
                                How to connect more than 2 streams in Flink?
                            
                                Querying Data from Apache Flink
                            
                                Flink error on using RichAggregateFunction
                            
                                Flink slot removed exception
                            
                                flink: applying multiple aggregations on a windowed stream
                            
                                Apache Flink CEP Pattern operation for NOT followedBy
                            
                                Why using apache kafka in real-time processing
                            
                                Apache Flink: How often is state de/serialized?
                            
                                Apache Flink: Using filter() or split() to split a stream?
                            
                                Flink Custom Partition Function
                            
                                Compose-Docker pull specific image:tag from a yml file service
                            
                                Two questions on Flink externalized checkpoints
                            
                                How to increase Flink taskmanager.numberOfTaskSlots to run it without Flink server(in IDE or fat jar)
                            
                                Storage in Apache Flink
                            
                                How to write the content of a Flink var to screen in Zeppelin?
                            
                                Flink dynamic scaling
                            
                                java.lang.NoSuchMethodException for init method in Scala case class
                            
                                Test csv files equality with random line order (Junit)
                            
                                Apache Flink: How to enable "upsert mode" for dynamic tables?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Flink window state size and state management

Tags:

apache-flink

stream-processing

yaarix

People also ask

1 Answers

Fabian Hueske

Recent Activity

Donate For Us