Spark 2.3.1 Structured Streaming state store inner working

Tags:

I have been going through the documentation of spark 2.3.1 on structured streaming, but could not find details of how stateful operation works internally with the the state store. More specifically what i would like to know is, (1) is the state store distributed? (2) if so then how, per worker or core ?

It seems like in previous version of Spark it was per worker but no idea for now. I know that it is backed by HDFS, but nothing explained how the in-memory store actually works.

Indeed is it a distributed in-memory store ? I am particularly interested in de-duplication, if data are stream from let say a large data set, then this need to be planned as the all "Distinct" DataSet will be ultimately held in memory as the end of the processing of that data set. Hence one need to plan the size of the worker or master depending on how that state store work.

235

asked Aug 17 '18 10:08

MaatDeamon

1 Answers

There is only one implementation of State Store in Structured Streaming which is backed by In-memory HashMap and HDFS. While In-Memory HashMap is for data storage, HDFS is for fault rolerance. The HashMap occupies executor memory on the worker and each HashMap represents a versioned key-value data of aggregated partition (generated after aggregator operator like deduplication, groupByy, etc)

But this does not explain how the HDFSBackedStateStore actually work. i don't see it in the documentation

You are correct that there is no such documentation available. I had to understand the code (2.3.1) , wrote an article on how State Store works internally in Structured Streaming. You might like to have a look : https://www.linkedin.com/pulse/state-management-spark-structured-streaming-chandan-prakash/

121

answered Oct 12 '22 01:10

chandan prakash

Related questions
                            
                                Is this a bug of spark stream or memory leak?
                            
                                PySpark s3 Access with Multiple AWS Credential Profiles?
                            
                                What to use to have graphical view of Spark's memory usage (with YARN)?
                            
                                Apache Spark sort partition by user ID and write each partition to CSV
                            
                                Why does sbt assembly fail with "Not a valid command: assembly"?
                            
                                Lost executor Spark
                            
                                PySpark: Numpy memory not being released in executor map-partition function (memory leak)
                            
                                Joining Spark DataFrames on a nearest key condition
                            
                                I cannot use --package option on bitnami/spark docker container
                            
                                Spark MLlib - Collaborative Filtering Implicit Feed
                            
                                Spark: What is the time complexity of the connected components algorithm used in GraphX?
                            
                                How to repartition evenly in Spark?
                            
                                Out of memory error when writing out spark dataframes to parquet format
                            
                                Difference between a map and udf
                            
                                Cassandra Error message: Not marking nodes down due to local pause. Why?
                            
                                Spark on localhost
                            
                                Spark RDD- map vs mapPartitions
                            
                                Sending Spark streaming metrics to open tsdb
                            
                                When are Spark RDD blocks created and destroyed/removed?
                            
                                Spark StringIndexer.fit is very slow on large records

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark 2.3.1 Structured Streaming state store inner working

Tags:

apache-spark

spark-structured-streaming

MaatDeamon

People also ask

1 Answers

chandan prakash

Recent Activity

Donate For Us