Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark 2.3.1 Structured Streaming state store inner working

I have been going through the documentation of spark 2.3.1 on structured streaming, but could not find details of how stateful operation works internally with the the state store. More specifically what i would like to know is, (1) is the state store distributed? (2) if so then how, per worker or core ?

It seems like in previous version of Spark it was per worker but no idea for now. I know that it is backed by HDFS, but nothing explained how the in-memory store actually works.

Indeed is it a distributed in-memory store ? I am particularly interested in de-duplication, if data are stream from let say a large data set, then this need to be planned as the all "Distinct" DataSet will be ultimately held in memory as the end of the processing of that data set. Hence one need to plan the size of the worker or master depending on how that state store work.

like image 235
MaatDeamon Avatar asked Aug 17 '18 10:08

MaatDeamon


People also ask

How does Spark Streaming work internally?

Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.

Where does Spark Streaming store the data?

Spark Streaming Sources Every input DStream (except file stream) associate with a Receiver object which receives the data from a source and stores it in Spark's memory for processing.

Why does state information need to be stored in Spark Streaming?

The purpose of the state store is to provide a reliable place from where the engine can read the intermediary result of Structured Streaming aggregations. Thanks to this place Spark can, even in the case of driver failure, recover the processing state to the point before the failure.

What is the difference between Spark Streaming and structured Streaming?

Both the Apache Spark streaming and the structured streaming models use micro- (or mini-) batching as their primary processing mechanisms. But it is the detail that changes. Ergo, Apache Spark uses DStreams, while structured streaming uses DataFrames to process these streams of data pouring into the analytics engine.


1 Answers

There is only one implementation of State Store in Structured Streaming which is backed by In-memory HashMap and HDFS. While In-Memory HashMap is for data storage, HDFS is for fault rolerance. The HashMap occupies executor memory on the worker and each HashMap represents a versioned key-value data of aggregated partition (generated after aggregator operator like deduplication, groupByy, etc)

But this does not explain how the HDFSBackedStateStore actually work. i don't see it in the documentation

You are correct that there is no such documentation available. I had to understand the code (2.3.1) , wrote an article on how State Store works internally in Structured Streaming. You might like to have a look : https://www.linkedin.com/pulse/state-management-spark-structured-streaming-chandan-prakash/

like image 121
chandan prakash Avatar answered Oct 12 '22 01:10

chandan prakash