Everywhere in Flink docs I see that a state is individual to a map function and a worker. This seems to be powerful in a standalone approach, but what if Flink runs in a cluster ? Can Flink handle a global state where all workers could add data and query it ?
From Flink article on states :
For high throughput and low latency in this setting, network communications among tasks must be minimized. In Flink, network communication for stream processing only happens along the logical edges in the job’s operator graph (vertically), so that the stream data can be transferred from upstream to downstream operators.
However, there is no communication between the parallel instances of an operator (horizontally). To avoid such network communication, data locality is a key principle in Flink and strongly affects how state is stored and accessed.
State in FlinkState snapshots, i.e., checkpoints and savepoints, are stored in a remote durable storage, and are used to restore the local state in the case of job failures. The appropriate state backend for a production deployment depends on scalability, throughput, and latency requirements.
Using Keyed State. The keyed state interfaces provides access to different types of state that are all scoped to the key of the current input element. This means that this type of state can only be used on a KeyedStream , which can be created via stream.
A Flink program consists of multiple tasks (transformations/operators, data sources, and sinks). A task is split into several parallel instances for execution and each parallel instance processes a subset of the task's input data. The number of parallel instances of a task is called its parallelism.
According to the Apache Flink documentation, KeyBy transformation logically partitions a stream into disjoint partitions. All records with the same key are assigned to the same partition.
I think that Flink only supports state on operators and state on Keyed streams, if you need some kind of global state, you have to store and recover data into some kind of database/file system/shared memory and mix that data with your stream.
Anyways, in my experiece, with a good processing pipeline design and partitioning your data in the right way, in most cases you should be able to apply divide and conquer algorithms or MapReduce strategies to archive your needs
If you introduce in your system some kind of global state, that global state could be a great bottleneck. So try to avoid it at all cost.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With