We have discussed the questions below: <ul> <li>What is the difference between Apache Spark and Apache Flink? [closed]</li> <li>What does “streaming” mean in Apache Spark and Apache Flink?</li> <li>What is the difference between mini-batch vs real time streaming in practice (not theory)?</li> </ul> But <code>Spark Structured Streaming</code> was added at Spark2.2, it brings a lot of changes for streaming, and it is outstanding. Can we say <code>Spark Strutured Streaming</code> is a streaming processing, or still batch processing? Now what is the big difference between <code>Apache Flink</code> and <code>Apache Spark Structured Streaming</code>?

Currently: Spark Structured Streaming has still microbatches used in background. However, it supports event-time processing, quite low latency (but not as low as Flink), supports SQL and type-safe queries on the streams in one API; no distinction, every Dataset can be queried both with SQL or with typesafe operators. It has end-to-end exactly-one semantics (at least they says it ;) ). The throughput is better than in Flink (there were some benchmarks with different results, but look at Databricks post about the results). In near future: Spark Continous Processing Mode is in progress and it will give Spark ~1ms latency, comparable to those from Flink. However, as I said, it's still in progress. The API is ready for non-batch jobs, so it's easier to do than in previous Spark Streaming. The main difference: Spark relies on micro-batching now and Flink is has pre-scheduled operators. That means, Flink's latency is lower, but Spark Community works on Continous Processing Mode, which will work similar (as far as I understand) to receivers.

Apache Spark Structured Streaming vs Apache Flink: what is the difference?

1 Answers

Currently:

Spark Structured Streaming has still microbatches used in background. However, it supports event-time processing, quite low latency (but not as low as Flink), supports SQL and type-safe queries on the streams in one API; no distinction, every Dataset can be queried both with SQL or with typesafe operators. It has end-to-end exactly-one semantics (at least they says it ;) ). The throughput is better than in Flink (there were some benchmarks with different results, but look at Databricks post about the results).

In near future:

Spark Continous Processing Mode is in progress and it will give Spark ~1ms latency, comparable to those from Flink. However, as I said, it's still in progress. The API is ready for non-batch jobs, so it's easier to do than in previous Spark Streaming.

The main difference:

Spark relies on micro-batching now and Flink is has pre-scheduled operators. That means, Flink's latency is lower, but Spark Community works on Continous Processing Mode, which will work similar (as far as I understand) to receivers.

answered Oct 22 '22 13:10

T. Gawęda

Related questions
                            
                                Big data signal analysis: better way to store and query signal data
                            
                                How to profile pyspark jobs
                            
                                PySpark: org.apache.spark.sql.AnalysisException: Attribute name ... contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it [duplicate]
                            
                                sbt assembly shading to create fat jar to run on spark
                            
                                Spark + Parquet + Snappy: Overall compression ratio loses after spark shuffles data
                            
                                Bypassing org.apache.hadoop.mapred.InvalidInputException: Input Pattern s3n://[...] matches 0 files
                            
                                Why does spark-shell --master yarn-client fail (yet pyspark --master yarn seems to work)?
                            
                                In spark join, does table order matter like in pig?
                            
                                Spark query running very slow
                            
                                Spark Error: Could not initialize class org.apache.spark.rdd.RDDOperationScope
                            
                                Spark Multi Label classification
                            
                                ALS model - predicted full_u * v^t * v ratings are very high
                            
                                How to get the progress bar (with stages and tasks) with yarn-cluster master?
                            
                                Spark DAG differs with 'withColumn' vs 'select'
                            
                                How to decide on the number of partitions required for input data size and cluster resources?
                            
                                Spark Streaming textFileStream not supporting wildcards
                            
                                When to prefer Hadoop MapReduce over Spark?
                            
                                How to join big dataframes in Spark SQL? (best practices, stability, performance)
                            
                                How to fetch offset id while consuming Kafka from Spark, save it in Cassandra and use it to restart Kafka?
                            
                                How to run Spark Scala code on Amazon EMR

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Apache Spark Structured Streaming vs Apache Flink: what is the difference?

Tags:

apache-spark

apache-flink

spark-structured-streaming

ShuMing Li

People also ask

1 Answers

T. Gawęda

Recent Activity

Donate For Us