Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What makes Kafka high in throughput?

Tags:

apache-kafka

Most articles depicts Kafka better in read/write throughput than other message broker(MB) like ActiveMQ. Per mine understanding reading/writing with the help of offset makes it faster. But I am not clear how offset makes it faster ?

After reading Kafka architecture, I have got some understanding but not clear what makes Kafka scalable and high in throughput based on below points :-

  1. Probably with the offset, client knows which exact message it needs to read which may be one of the factor to make it high in performance.

    And in case of other MB's , broker need to coordinate among consumers so that message is delivered to only consumer. But this is the case for queues only not for topics. Then What makes Kafka topic faster than other MB's topic.

  2. Kafka provides partitioning for scalability but other message broker(MB) like ActiveMQ also provides the clustering. so how Kafka is better for big data/high loads ?

  3. In other MB's we can have listeners . So as soon as message comes, broker will deliver the message but in case of Kafka we need to poll which means more load on both broker/client side ?

like image 989
user3198603 Avatar asked Jun 19 '17 02:06

user3198603


2 Answers

Lots of details on what makes Kafka different and faster than other messaging systems are in Jay Kreps blog post here

https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

There are actually a lot of differences that make Kafka perform well including but not limited to:

  • Maximized use of sequential disk reads and writes
  • Zero-copy processing of messages
  • Use of Linux OS page cache rather than Java heap for caching
  • Partitioning of topics across multiple brokers in a cluster
  • Smart client libraries that offload certain functions from the brokers
  • Batching of multiple published messages to yield less frequent network round trips to the broker
  • Support for multiple in-flight messages
  • Prefetching data into client buffers for faster subsequent requests.
like image 99
Hans Jespersen Avatar answered Dec 09 '22 05:12

Hans Jespersen


It's largely marketing that Kafka is fast for a message broker. For example IBM MessageSight appliances did 13M msgs/sec with microsecond latency in 2013. On one machine. A year before Kreps even started the Github.: https://www.zdnet.com/article/ibm-launches-messagesight-appliance-aimed-at-m2m/

Kafka is good for a lot of things. True low latency messaging is not one of them. You flatly can't use batch delivery (e.g. a range of offsets) in any pure latency-centric environment. When an event arrives, delivery must be attempted immediately if you want the lowest latency. That doesn't mean waiting around for a couple seconds to batch read a block of events or enduring the overhead of requesting every message. Try using Kafka with an offset range of 1 (so: 1 message) if you want to compare it to a normal push-based broker and you'll see what I mean.

Instead, I recommend focusing on the thing pull-based stream buffering does give you:

  • Replayability!!!

Personally, I think this makes downstream data engineering systems a bit easier to build in the face of failure, particularly since you don't have to rely on their built-in replication models (if they even have one). For example, it's very easy for me to consume messages, lose the disks, restore the machine, and replay the lost data. The data streams become the single source of truth against which other systems can synchronize and this is exceptionally useful!!!

There's no free lunch in messaging, pull and push each have their advantages and disadvantages vs. each other. It might not surprise you that people have also tried push-pull messaging and it's no free lunch either :).

like image 43
Rob Bird Avatar answered Dec 09 '22 04:12

Rob Bird