Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why using apache kafka in real-time processing

Lately I've been looking into real-time data processing using storm, flink, etc... All architectures I came through uses kafka as a layer between datasources and the stream processor, why this layer should exist ?

like image 651
Exorcismus Avatar asked Apr 13 '17 21:04

Exorcismus


People also ask

Where Kafka is used in real-time?

Hotels.com uses Kafka as pipeline to collect real time events from multiple sources and for sending data to HDFS. Kafka is used as a distributed high speed message queue in our help desk software as well as our real-time event data aggregation and analytics.

Why do we use Apache Kafka?

Why would you use Kafka? Kafka is used to build real-time streaming data pipelines and real-time streaming applications. A data pipeline reliably processes and moves data from one system to another, and a streaming application is an application that consumes streams of data.

Is Kafka used for data processing?

Kafka is often used to build real-time data streams and applications. Combining communications, storage, and stream processing enables the collection and analysis of real-time and historical data. It is a Scala and Java application frequently used for big data analytics and real-time event stream processing.

What is the benefits of Apache Kafka over the traditional technique?

Apache Kafka has following benefits above traditional messaging technique: Fast: A single Kafka broker can serve thousands of clients by handling megabytes of reads and writes per second. Scalable: Data are partitioned and streamlined over a cluster of machines to enable larger data.

How does Apache Kafka work?

Apache Kafka solves this slow, multi-step process by acting as an intermediary receiving data from source systems and then making this data available to target systems in real time. What’s more, your systems won’t crash because Apache Kafka is its own separate set of servers (called an Apache Kafka cluster).

What is a stream processor in Kafka?

In Kafka, a stream processor is anything that takes continual streams of data from input topics, performs some processing on this input, and produces continual streams of data to output topics.

Is Kafka just a messaging system?

If you learn one thing from the examples in this blog post, then remember that Kafka is not just a messaging system! While data ingestion into a Hadoop data lake was the first prominent use case, this implies <5% of actual Kafka deployments.

Why CTOs are increasingly adopting Apache Kafka?

Now that Apache Kafka has reached a stable 1.0 version, more companies are adopting the technology as the backbone of their IT infrastructure. Increasingly, CTOs are prioritizing enabling more real-time architecture and reducing the wait time on data availability.


1 Answers

I think there are three main reasons why to use Apache Kafka for real-time processing:

  • Distribution
  • Performance
  • Reliability

In real-time processing, there is a requirement for fast and reliable delivery of data from data-sources to stream processor. If u are not doing it well, it can easily become a bottleneck of your real-time processing system. Here is where Kafka can help.

Before, traditional messaging ApacheMQ and RabbitMQ was not particularly good for handling huge amount of data in real-time. For that reason Linkedin engineers developed their own messaging system Apache Kafka to be able to cope with this issue.

Distribution: Kafka is natively distributed which fits to distribution nature of stream processing. Kafka divides incoming data to partition ordered by offset which are physically distributed over the cluster. Then these partition can feed the stream processor in distributed manner.

Performance: Kafka was designed to be simple, sacrificing advance features for the sake of performance. Kafka outperform traditional messaging systems by big difference which can be seen also in this paper. The main reasons are mentioned below:

  • The Kafka producer does not wait for acknowledgments from the broker and send data as fast as broker can handle

  • Kafka has a more efficient storage format with less meta-data.

  • The Kafka broker is stateless, it does not need to take care about the state of consumers.

  • Kafka exploits the UNIX sendfile API to efficiently deliver data from a broker to a consumer by reducing the number of data copies and system calls.

Reliability: Kafka serves as a buffer between data sources and the stream processor to handle a big load of data. Kafka just simple store all the incoming data and consumers are responsible for the decision how much and how fast they want to process data. This ensure reliable load-balancing that the stream processor will be not overwhelmed by too many data.

Kafka retention policy also allows to easy recover from failures during processing (Kafka retain all the data for 7 days by default). Each consumers keep track on offset of last processed message. For this reason if some consumer fails, it is easy to rollback to the point right before failure and start processing again without loosing information or need to reprocess all stream from beginning.

like image 125
Stefan Repcek Avatar answered Oct 04 '22 03:10

Stefan Repcek