Why using apache kafka in real-time processing

Q: Why CTOs are increasingly adopting Apache Kafka?

Now that Apache Kafka has reached a stable 1.0 version, more companies are adopting the technology as the backbone of their IT infrastructure. Increasingly, CTOs are prioritizing enabling more real-time architecture and reducing the wait time on data availability.

Tags:

apache-kafka

apache-storm

apache-flink

Lately I've been looking into real-time data processing using storm, flink, etc... All architectures I came through uses kafka as a layer between datasources and the stream processor, why this layer should exist ?

651

asked Apr 13 '17 21:04

Exorcismus

1 Answers

I think there are three main reasons why to use Apache Kafka for real-time processing:

Distribution
Performance
Reliability

In real-time processing, there is a requirement for fast and reliable delivery of data from data-sources to stream processor. If u are not doing it well, it can easily become a bottleneck of your real-time processing system. Here is where Kafka can help.

Before, traditional messaging ApacheMQ and RabbitMQ was not particularly good for handling huge amount of data in real-time. For that reason Linkedin engineers developed their own messaging system Apache Kafka to be able to cope with this issue.

Distribution: Kafka is natively distributed which fits to distribution nature of stream processing. Kafka divides incoming data to partition ordered by offset which are physically distributed over the cluster. Then these partition can feed the stream processor in distributed manner.

Performance: Kafka was designed to be simple, sacrificing advance features for the sake of performance. Kafka outperform traditional messaging systems by big difference which can be seen also in this paper. The main reasons are mentioned below:

The Kafka producer does not wait for acknowledgments from the broker and send data as fast as broker can handle
Kafka has a more efficient storage format with less meta-data.
The Kafka broker is stateless, it does not need to take care about the state of consumers.
Kafka exploits the UNIX sendfile API to efficiently deliver data from a broker to a consumer by reducing the number of data copies and system calls.

Reliability: Kafka serves as a buffer between data sources and the stream processor to handle a big load of data. Kafka just simple store all the incoming data and consumers are responsible for the decision how much and how fast they want to process data. This ensure reliable load-balancing that the stream processor will be not overwhelmed by too many data.

Kafka retention policy also allows to easy recover from failures during processing (Kafka retain all the data for 7 days by default). Each consumers keep track on offset of last processed message. For this reason if some consumer fails, it is easy to rollback to the point right before failure and start processing again without loosing information or need to reprocess all stream from beginning.

125

answered Oct 04 '22 03:10

Stefan Repcek

Related questions
                            
                                Consumer and producer failing with error: "Connection to 0 was disconnected before the response was read"
                            
                                How to stream large files through Kafka?
                            
                                What happens to persistent volume if the StatefulSet got deleted and re-created?
                            
                                Kafka MirrorMaker 2.0 duplicate each messages
                            
                                What is the difference between MANUAL and MANUAL_IMMEDIATE in spring-kafka AckMode
                            
                                Kafka: what is the point of using "acknowledgment.nack" if I can simply "not acknowledgment.acknowledge"
                            
                                How to remove a Kafka consumer group from a specific topic?
                            
                                Can't start Kafka Connect with MongoDb plugin with Apache Kafka
                            
                                Is event sourcing an enhanced pattern of choreography-based SAGA pattern?
                            
                                Interactive admin shell for Apache Kafka
                            
                                Kafka - Consumer group creation with specific offset?
                            
                                Direct Kafka Stream with PySpark (Apache Spark 1.6)
                            
                                kafka 0.9.0.1 fails to start with fatal exception
                            
                                WARN Error while fetching metadata with correlation id 1 : {MY_TOPIC?=INVALID_TOPIC_EXCEPTION} (org.apache.kafka.clients.NetworkClient)
                            
                                Kafka topic partition and Spark executor mapping
                            
                                get topic from kafka message in spark
                            
                                Error reading field 'topics': java.nio.BufferUnderflowException in Kafka
                            
                                Filebeat 5.0 output to Kafka multiple topics
                            
                                Apache Kafka create topic from code [duplicate]
                            
                                Running Kafka cluster in Docker containers?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With