Apache Spark brags that its operators (nodes) are "stateless". This allows Spark's architecture to use simpler protocols for things like recovery, load balancing, and handling stragglers.
On the other hand Apache Flink describes its operators as "stateful", and claim that statefulness is necessary for applications like machine learning. Yet Spark programs are able to pass forward information and maintain application data in RDDs without maintaining "state".
What is happening here? Is Spark not a true stateless system? Or is Flink's assertion that statefulness is essential for machine learning and similar application incorrect? Or is there some additional nuance here?
I don't feel like I truly grok the difference between "stateful" and "stateless" systems, and I would appreciate if they could be explained.
Stateful services keep track of sessions or transactions and react differently to the same inputs based on that history. Stateless services rely on clients to maintain sessions and center around operations that manipulate resources, rather than the state.
A very crude example of a Stateless application could be a calculator that always start with zero without storing the calculations or data from before. Similarly, your computer terminal is more of a Stateful application as it can store your previous ran commands and some data.
Stateful means the computer or program keeps track of the state of interaction, usually by setting values in a storage field designated for that purpose. Stateless means there is no record of previous interactions and each interaction request has to be handled based entirely on information that comes with it.
In stateless, the server is not required to retain the information about the state. In stateful, the server is required keep information about the current state and session. It is easy to scale the architecture. It is not easy to scale the architecture.
The property of state refers to being able to access data from a previous point in time in the current point in time.
What does this mean? Assume I want to do a word count of all words which have arrived to my streaming application. But the nature of streaming is that data flows in and out of the pipeline. In order to be able to access previous data, in this example some kind of map which holds what was the previous number of words in the stream, I have to access some state which was accumulated.
While some of Sparks RDD operators are stateless, such as map
, filter
etc, it does expose stateful operators in the form of mapWithState
. Not only that, in the new Spark streaming architecture, called "Structured Streaming", state is built into the pipeline and mostly abstracted away from the user in order to be able to expose aggregation operators, such as agg
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With