I'm a newbie for Kafka. When I read the documentation of Kafka, I saw that Kafka is performing well because of sequential disk access.
But how is that possible? In Java(or something else), If I use File I/O, OS will handle it appropriately. However, I can't know if OS store the files I want to store in multiple sectors or in contiguous sectors. So, Kafka cannot always say that sequential disk access occurs in my opinion.
Am I true or not?
So from our point of view, the most important thing to remember is that Kafka preserves the order of messages within a partition. For some use cases, preserving ordering of messages can be very important from a business point of view.
Kafka delivery guarantees can be divided into three groups which include “at most once”, “at least once” and “exactly once”. at most once which can lead to messages being lost, but they cannot be redelivered or duplicated at least once ensures messages are never lost but may be duplicated
The graphic below provides an overview of various Kafka components and how they interact. In Kafka, order can only be guaranteed within a partition. This means that if messages were sent from the producer in a specific order, the broker will write them to a partition and all consumers will read from that in the same order.
When it comes to exactly once semantics with Kafka and external systems, the restriction is not necessarily a feature of the messaging system, but the necessity for coordination of the consumer’s position with what is stored in the destination. For example, a destination might be in an HDFS or object store based data lake.
Kafka does not always access disk sequentially but it does some things that make it much more likely that disk access is often sequential. All Kafka messages are stored in larger segment files (1GB each by default) and since Kafka messages are not deleted when consumed (like in other message brokers) Kafka will not end up creating a fragmented filesystem over time by continuously creating and deleting many variable length files. Instead it creates segment files and then appends to that file until it reaches 1GB (a configurable limit). Only when all messages in the segment expire will it delete the entire 1GB segment. This means that often these 1GB sections of disk are actually laid out as contiguous blocks. It is a recommended best practice to keep these Kafka commit log files on a dedicated filesystem so it does not get fragmented by other apps reading and writing variable length files into the same filesystem. More importantly most reading an writing to these segment files is sequential and goes through OS page cache so as to reduce disk I/O even further by caching the most often accessed pages in memory. This is why it is a recommendation to tune the kernel to set swappiness to 1 to reduce the likelihood that these cached pages would get swapped out of memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With