Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark Structured Streaming vs Apache Flink: what is the difference?

We have discussed the questions below:

  • What is the difference between Apache Spark and Apache Flink? [closed]
  • What does “streaming” mean in Apache Spark and Apache Flink?
  • What is the difference between mini-batch vs real time streaming in practice (not theory)?

But Spark Structured Streaming was added at Spark2.2, it brings a lot of changes for streaming, and it is outstanding.

Can we say Spark Strutured Streaming is a streaming processing, or still batch processing?

Now what is the big difference between Apache Flink and Apache Spark Structured Streaming?

like image 751
ShuMing Li Avatar asked Sep 01 '17 07:09

ShuMing Li


People also ask

What is the difference between Spark streaming and structured streaming?

Spark receives real-time data and divides it into smaller batches for the execution engine. In contrast, Structured Streaming is built on the SparkSQL API for data stream processing. In the end, all the APIs are optimized using Spark catalyst optimizer and translated into RDDs for execution under the hood.

Which is better Flink or Spark?

But Flink managed to stay ahead in the game because of its stream processing feature, which manages to process rows upon rows of data in real time – which is not possible in Apache Spark's batch processing method. This makes Flink faster than Spark.

What is the difference between Spark and Spark streaming?

Generally, Spark streaming is used for real time processing. But it is an older or rather you can say original, RDD based Spark structured streaming is the newer, highly optimized API for Spark. Users are advised to use the newer Spark structured streaming API for Spark. Hope this will clear your doubt.

What is Apache spark structured streaming?

Apache Spark Structured Streaming is a near-real time processing engine that offers end-to-end fault tolerance with exactly-once processing guarantees using familiar Spark APIs. Structured Streaming lets you express computation on streaming data in the same way you express a batch computation on static data.


1 Answers

Currently:

Spark Structured Streaming has still microbatches used in background. However, it supports event-time processing, quite low latency (but not as low as Flink), supports SQL and type-safe queries on the streams in one API; no distinction, every Dataset can be queried both with SQL or with typesafe operators. It has end-to-end exactly-one semantics (at least they says it ;) ). The throughput is better than in Flink (there were some benchmarks with different results, but look at Databricks post about the results).

In near future:

Spark Continous Processing Mode is in progress and it will give Spark ~1ms latency, comparable to those from Flink. However, as I said, it's still in progress. The API is ready for non-batch jobs, so it's easier to do than in previous Spark Streaming.

The main difference:

Spark relies on micro-batching now and Flink is has pre-scheduled operators. That means, Flink's latency is lower, but Spark Community works on Continous Processing Mode, which will work similar (as far as I understand) to receivers.

like image 90
T. Gawęda Avatar answered Oct 22 '22 13:10

T. Gawęda