Lately I've been looking into real-time data processing using storm, flink, etc... All architectures I came through uses kafka as a layer between datasources and the stream processor, why this layer should exist ?
Hotels.com uses Kafka as pipeline to collect real time events from multiple sources and for sending data to HDFS. Kafka is used as a distributed high speed message queue in our help desk software as well as our real-time event data aggregation and analytics.
Why would you use Kafka? Kafka is used to build real-time streaming data pipelines and real-time streaming applications. A data pipeline reliably processes and moves data from one system to another, and a streaming application is an application that consumes streams of data.
Kafka is often used to build real-time data streams and applications. Combining communications, storage, and stream processing enables the collection and analysis of real-time and historical data. It is a Scala and Java application frequently used for big data analytics and real-time event stream processing.
Apache Kafka has following benefits above traditional messaging technique: Fast: A single Kafka broker can serve thousands of clients by handling megabytes of reads and writes per second. Scalable: Data are partitioned and streamlined over a cluster of machines to enable larger data.
Apache Kafka solves this slow, multi-step process by acting as an intermediary receiving data from source systems and then making this data available to target systems in real time. What’s more, your systems won’t crash because Apache Kafka is its own separate set of servers (called an Apache Kafka cluster).
In Kafka, a stream processor is anything that takes continual streams of data from input topics, performs some processing on this input, and produces continual streams of data to output topics.
If you learn one thing from the examples in this blog post, then remember that Kafka is not just a messaging system! While data ingestion into a Hadoop data lake was the first prominent use case, this implies <5% of actual Kafka deployments.
Now that Apache Kafka has reached a stable 1.0 version, more companies are adopting the technology as the backbone of their IT infrastructure. Increasingly, CTOs are prioritizing enabling more real-time architecture and reducing the wait time on data availability.
I think there are three main reasons why to use Apache Kafka for real-time processing:
In real-time processing, there is a requirement for fast and reliable delivery of data from data-sources to stream processor. If u are not doing it well, it can easily become a bottleneck of your real-time processing system. Here is where Kafka can help.
Before, traditional messaging ApacheMQ and RabbitMQ was not particularly good for handling huge amount of data in real-time. For that reason Linkedin engineers developed their own messaging system Apache Kafka to be able to cope with this issue.
Distribution: Kafka is natively distributed which fits to distribution nature of stream processing. Kafka divides incoming data to partition ordered by offset which are physically distributed over the cluster. Then these partition can feed the stream processor in distributed manner.
Performance: Kafka was designed to be simple, sacrificing advance features for the sake of performance. Kafka outperform traditional messaging systems by big difference which can be seen also in this paper. The main reasons are mentioned below:
The Kafka producer does not wait for acknowledgments from the broker and send data as fast as broker can handle
Kafka has a more efficient storage format with less meta-data.
The Kafka broker is stateless, it does not need to take care about the state of consumers.
Kafka exploits the UNIX sendfile API to efficiently deliver data from a broker to a consumer by reducing the number of data copies and system calls.
Reliability: Kafka serves as a buffer between data sources and the stream processor to handle a big load of data. Kafka just simple store all the incoming data and consumers are responsible for the decision how much and how fast they want to process data. This ensure reliable load-balancing that the stream processor will be not overwhelmed by too many data.
Kafka retention policy also allows to easy recover from failures during processing (Kafka retain all the data for 7 days by default). Each consumers keep track on offset of last processed message. For this reason if some consumer fails, it is easy to rollback to the point right before failure and start processing again without loosing information or need to reprocess all stream from beginning.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With