I have been trying to find materials online - both are micro-batch based - so what's the difference ?

Brief description about Spark Streaming(RDD/DStream) and Spark Structured Streaming(Dataset/DataFrame) Spark Streaming is based on DStream. A DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset. Spark Streaming has the following problems. Difficult - it was not simple to built streaming pipelines supporting delivery policies: exactly once guarantee, handling data arrival in late or fault tolerance. Sure, all of them were implementable but they needed some extra work from the part of programmers. Incosistent - API used to generate batch processing (RDD, Dataset) was different that the API of streaming processing (DStream). Sure, nothing blocker to code but it's always simpler (maintenance cost especially) to deal with at least abstractions as possible. see the example Spark Streaming flow diagram :- <img src="https://i.stack.imgur.com/P1IEx.png" alt="Spark Streaming flow diagram"> Spark Structured Streaming be understood as an unbounded table, growing with new incoming data, i.e. can be thought as stream processing built on Spark SQL. More concretely, structured streaming brought some new concepts to Spark. exactly-once guarantee - structured streaming focuses on that concept. It means that data is processed only once and output doesn't contain duplicates. event time - one of observed problems with DStream streaming was processing order, i.e the case when data generated earlier was processed after later generated data. Structured streaming handles this problem with a concept called event time that, under some conditions, allows to correctly aggregate late data in processing pipelines. sink,Result Table,output mode and watermark are other features of spark structured streaming. see the example Spark Structured Streaming flow diagram :- <img src="https://i.stack.imgur.com/krczM.png" alt="enter image description here">

What is the difference between Spark Structured Streaming and DStreams?

1 Answers

Brief description about Spark Streaming(RDD/DStream) and Spark Structured Streaming(Dataset/DataFrame)

Spark Streaming is based on DStream. A DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset. Spark Streaming has the following problems.

Difficult - it was not simple to built streaming pipelines supporting delivery policies: exactly once guarantee, handling data arrival in late or fault tolerance. Sure, all of them were implementable but they needed some extra work from the part of programmers.

Incosistent - API used to generate batch processing (RDD, Dataset) was different that the API of streaming processing (DStream). Sure, nothing blocker to code but it's always simpler (maintenance cost especially) to deal with at least abstractions as possible.

see the example

Spark Streaming flow diagram :-

Spark Structured Streaming be understood as an unbounded table, growing with new incoming data, i.e. can be thought as stream processing built on Spark SQL.

More concretely, structured streaming brought some new concepts to Spark.

exactly-once guarantee - structured streaming focuses on that concept. It means that data is processed only once and output doesn't contain duplicates.

event time - one of observed problems with DStream streaming was processing order, i.e the case when data generated earlier was processed after later generated data. Structured streaming handles this problem with a concept called event time that, under some conditions, allows to correctly aggregate late data in processing pipelines.

sink,Result Table,output mode and watermark are other features of spark structured streaming.

see the example

Spark Structured Streaming flow diagram :-

enter image description here

191

answered Oct 20 '22 23:10

Rohit Yadav

Related questions
                            
                                How can I tell if my spark job is progressing?
                            
                                Difference between spark-submit vs. SparkSession in python script?
                            
                                Spark ML Pipeline with RandomForest takes too long on 20MB dataset
                            
                                Understanding DAG in spark
                            
                                Databricks display() function equivalent or alternative to Jupyter
                            
                                PySpark dataframe to_json() function
                            
                                How to run two spark jobs in parallel in standalone mode [duplicate]
                            
                                Spark - Reading many small parquet files gets status of each file before hand
                            
                                How to let pyspark display the whole query plan instead of ... if there are many fields?
                            
                                Does reducing the number of executor-cores consume less executor-memory?
                            
                                Spark policy for handling multiple watermarks
                            
                                Why does spark-shell throw ArrayIndexOutOfBoundsException when reading a large file from HDFS?
                            
                                Spark 1.6: filtering DataFrames generated by describe()
                            
                                Does registerTempTable cause the table to get cached?
                            
                                What does the 'pyspark.sql.functions.window' function's 'startTime' argument do?
                            
                                Error in running Spark in Intellij : "object apache is not a member of package org"
                            
                                How can I print nulls when converting a dataframe to json in Spark
                            
                                SparkSession initialization error - Unable to use spark.read
                            
                                Spark: can you include partition columns in output files?
                            
                                What are the benefits of SparkLauncher vs java -jar fat-jar?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the difference between Spark Structured Streaming and DStreams?

Tags:

apache-spark

spark-streaming

SunnyAk

People also ask

1 Answers

Rohit Yadav

Recent Activity

Donate For Us