What is the difference between a "stateful" and "stateless" system?

Tags:

Apache Spark brags that its operators (nodes) are "stateless". This allows Spark's architecture to use simpler protocols for things like recovery, load balancing, and handling stragglers.

On the other hand Apache Flink describes its operators as "stateful", and claim that statefulness is necessary for applications like machine learning. Yet Spark programs are able to pass forward information and maintain application data in RDDs without maintaining "state".

What is happening here? Is Spark not a true stateless system? Or is Flink's assertion that statefulness is essential for machine learning and similar application incorrect? Or is there some additional nuance here?

I don't feel like I truly grok the difference between "stateful" and "stateless" systems, and I would appreciate if they could be explained.

699

asked Mar 03 '18 22:03

Shuklaswag

1 Answers

The property of state refers to being able to access data from a previous point in time in the current point in time.

What does this mean? Assume I want to do a word count of all words which have arrived to my streaming application. But the nature of streaming is that data flows in and out of the pipeline. In order to be able to access previous data, in this example some kind of map which holds what was the previous number of words in the stream, I have to access some state which was accumulated.

While some of Sparks RDD operators are stateless, such as map, filter etc, it does expose stateful operators in the form of mapWithState. Not only that, in the new Spark streaming architecture, called "Structured Streaming", state is built into the pipeline and mostly abstracted away from the user in order to be able to expose aggregation operators, such as agg.

answered Oct 04 '22 01:10

Yuval Itzchakov

Related questions
                            
                                Does the shuffle step in a MapReduce program run in parallel with Mapping?
                            
                                warning:Multiple versions of scala libraries detected?
                            
                                How to filter after group by and aggregate in Spark dataframe?
                            
                                How to time Spark program execution speed
                            
                                spark importing data from oracle - java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver
                            
                                Does Spark Supports With Clause?
                            
                                Spark persist temp view
                            
                                Spark job failing due to space issue
                            
                                How to deal with array<String> in spark dataframe?
                            
                                Low cpu usage while running a spark job
                            
                                How to use a predicate while reading from JDBC connection?
                            
                                using DataSet.repartition in Spark 2 - several tasks handle more than one partition
                            
                                Does CrossValidator in PySpark distribute the execution?
                            
                                Spark, Scala - How to get Top 3 value from each group of two column in dataframe [duplicate]
                            
                                PATH issue: Could not find valid SPARK_HOME while searching
                            
                                How to (equally) partition array-data in spark dataframe
                            
                                Spark UDF not running in parallel
                            
                                Spark window function on dataframe with large number of columns
                            
                                Passing multiple system properties to google dataproc cluster job
                            
                                Xml processing in Spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the difference between a "stateful" and "stateless" system?

Tags:

state

apache-spark

streaming

apache-flink

spark-streaming

Shuklaswag

People also ask

1 Answers

Yuval Itzchakov

Recent Activity

Donate For Us