Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The benefits of Flink Kafka Stream over Spark Kafka Stream? And Kafka Stream over Flink? [closed]

In spark stream, we set the batch interval for nearly realtime microbatch processing. In Flink (DataStream) or Storm, stream is realtime, so I guess there is no such concept of batch interval.

In kafka, the consumer is pulling, I imagine that Spark uses the batch interval parameter to pull out the messages from Kafka broker, so how does Flink and Storm do it? I imagine that Flink and Storm pull the Kafka messages in a fast loop to form the realtime stream source, if so, and if I set Spark batch interval to be small such as 100ms, 50ms or even smaller, do we have significant differences between Spark Streaming and Flink or Storm?

Meanwhile, in Spark, if the streaming data is large and batch interval is too small, we may meet a situation that there are lots of data being waiting to be processed, and therefore there is change we will see OutOfMemmory happens. Would it happen in Flink or Storm?

I have implemented an application to do topic-to-topic transformation, the transformation is easy, but source data could be huge (considering it a IoT app). My original implementation is backed by reactive-kafka, it works fine in my standalone Scala/Akka app. I did not implemented the application to be clustered, because if I need it, Flink/Storm/Spark are already there. Then I found Kafka Stream, to me it is similar to reactive-akka in the view of client usage. So, if I use Kafka Stream or reactive-kafka in standalone applications or microservices, do we have to concern about the reliability/availability of the client code?

like image 892
Stephen Kuo Avatar asked Oct 24 '16 03:10

Stephen Kuo


People also ask

Why use Kafka streams over Kafka?

2.2. Kafka Streams greatly simplifies the stream processing from topics. Built on top of Kafka client libraries, it provides data parallelism, distributed coordination, fault tolerance, and scalability.

Is Flink better than spark?

Flink's low latency outperforms Spark consistently, even at higher throughput. Spark can achieve low latency with lower throughput, but increasing the throughput will also increase the latency.

What is the primary difference between Kafka streams and spark streaming?

Kafka analyses the events as they unfold. As a result, it employs a continuous (event-at-a-time) processing model. Spark, on the other hand, uses a micro-batch processing approach, which divides incoming streams into small batches for processing.

What is difference between Kafka and Flink?

Apache Flink is a stream processing framework that can be used easily with Java. Apache Kafka is a distributed stream processing system supporting high fault-tolerance. In this tutorial, we-re going to have a look at how to build a data pipeline using those two technologies.


1 Answers

You understanding about micro-batch vs stream processing is correct. You are also right, that all three system use the standard Java consumer that is provided by Kafka to pull data for processing in an infinite loop.

The main difference is, that Spark needs to schedule a new job for each micro batch it processes. And this scheduling overhead in quite high, such that Spark cannot handle very low batch intervals like 100ms or 50ms efficiently and thus throughput goes down for those small batches.

Flink and Storm are both true streaming systems, thus both deploy the job only once at startup (and the job runs continuously until explicitly shut down by the user) and thus they can handle each individual input record without overhead and very low latency.

Furthermore for Flink, JVM main memory is not a limitation because Flink can use off-head memory as well as write to disk if available main memory is too small. (Btw: Spark since project Tungsten, can also use off-heap memory, but they can spill to disk to some extent -- but different than Flink AFAIK). Storm, AFAIK, does neither and is limited to JVM memory.

I am not familiar with reactive Kafka.

For Kafka Streams, it is a fully fault-tolerant, stateful stream processing library. It is design for micro service development (you do not need a dedicated processing cluster as for Flink/Storm/Spark) but can deploy your application instances anywhere and in any way to want. You scale you application by simply starting up more instances. Check out the documentation for more details: http://docs.confluent.io/current/streams/index.html (there are also interesting posts about Kafka Streams in Confluent blog: http://www.confluent.io/blog/)

like image 164
Matthias J. Sax Avatar answered Nov 13 '22 12:11

Matthias J. Sax