The benefits of Flink Kafka Stream over Spark Kafka Stream? And Kafka Stream over Flink? [closed]

Tags:

In spark stream, we set the batch interval for nearly realtime microbatch processing. In Flink (DataStream) or Storm, stream is realtime, so I guess there is no such concept of batch interval.

In kafka, the consumer is pulling, I imagine that Spark uses the batch interval parameter to pull out the messages from Kafka broker, so how does Flink and Storm do it? I imagine that Flink and Storm pull the Kafka messages in a fast loop to form the realtime stream source, if so, and if I set Spark batch interval to be small such as 100ms, 50ms or even smaller, do we have significant differences between Spark Streaming and Flink or Storm?

Meanwhile, in Spark, if the streaming data is large and batch interval is too small, we may meet a situation that there are lots of data being waiting to be processed, and therefore there is change we will see OutOfMemmory happens. Would it happen in Flink or Storm?

I have implemented an application to do topic-to-topic transformation, the transformation is easy, but source data could be huge (considering it a IoT app). My original implementation is backed by reactive-kafka, it works fine in my standalone Scala/Akka app. I did not implemented the application to be clustered, because if I need it, Flink/Storm/Spark are already there. Then I found Kafka Stream, to me it is similar to reactive-akka in the view of client usage. So, if I use Kafka Stream or reactive-kafka in standalone applications or microservices, do we have to concern about the reliability/availability of the client code?

892

asked Oct 24 '16 03:10

Stephen Kuo

1 Answers

You understanding about micro-batch vs stream processing is correct. You are also right, that all three system use the standard Java consumer that is provided by Kafka to pull data for processing in an infinite loop.

The main difference is, that Spark needs to schedule a new job for each micro batch it processes. And this scheduling overhead in quite high, such that Spark cannot handle very low batch intervals like 100ms or 50ms efficiently and thus throughput goes down for those small batches.

Flink and Storm are both true streaming systems, thus both deploy the job only once at startup (and the job runs continuously until explicitly shut down by the user) and thus they can handle each individual input record without overhead and very low latency.

Furthermore for Flink, JVM main memory is not a limitation because Flink can use off-head memory as well as write to disk if available main memory is too small. (Btw: Spark since project Tungsten, can also use off-heap memory, but they can spill to disk to some extent -- but different than Flink AFAIK). Storm, AFAIK, does neither and is limited to JVM memory.

I am not familiar with reactive Kafka.

For Kafka Streams, it is a fully fault-tolerant, stateful stream processing library. It is design for micro service development (you do not need a dedicated processing cluster as for Flink/Storm/Spark) but can deploy your application instances anywhere and in any way to want. You scale you application by simply starting up more instances. Check out the documentation for more details: http://docs.confluent.io/current/streams/index.html (there are also interesting posts about Kafka Streams in Confluent blog: http://www.confluent.io/blog/)

164

answered Nov 13 '22 12:11

Matthias J. Sax

Related questions
                            
                                Kafka - problems with TimestampExtractor
                            
                                python vs java for kafka implementation
                            
                                Why __consumer_offsets topic in kafka is not spreading to all the brokers?
                            
                                Unable to connect broker - kafka Tool
                            
                                Limit on the number of topics in Kafka
                            
                                Join on foreign key in Kafka stream
                            
                                How to read json data using scala from kafka topic in apache spark
                            
                                Kafka running on zookeeper subcontext or chroot
                            
                                Install Kafka as windows service
                            
                                how to specify consumer group in Kafka Spark Streaming using direct stream
                            
                                How to run following command to test kafka server is installed properly or not?
                            
                                How do I set up a Kafka service on gitlab-ci.yml?
                            
                                How to skip corrupt (non-serializable) messages in Spring Kafka Consumer?
                            
                                Kafka consumer.poll returns no records
                            
                                Debezium with AWS MSK NOT_ENOUGH_REPLICAS
                            
                                Kafka: Number of Partitions are more than no of broker
                            
                                Kafka Acknowledgment vs Kafka commit
                            
                                "unreasonable length" when running kafka-topics command
                            
                                Enable SSL for Kafka Clients
                            
                                Storing Avro schema in schema registry

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

The benefits of Flink Kafka Stream over Spark Kafka Stream? And Kafka Stream over Flink? [closed]

Tags:

apache-kafka

apache-storm

apache-flink

apache-kafka-streams

spark-streaming

Stephen Kuo

People also ask

1 Answers

Matthias J. Sax

Recent Activity

Donate For Us