How does Kafka guarantee sequential disk access?

Tags:

apache-kafka

I'm a newbie for Kafka. When I read the documentation of Kafka, I saw that Kafka is performing well because of sequential disk access.

But how is that possible? In Java(or something else), If I use File I/O, OS will handle it appropriately. However, I can't know if OS store the files I want to store in multiple sectors or in contiguous sectors. So, Kafka cannot always say that sequential disk access occurs in my opinion.

Am I true or not?

909

asked Aug 18 '17 08:08

devhak

1 Answers

Kafka does not always access disk sequentially but it does some things that make it much more likely that disk access is often sequential. All Kafka messages are stored in larger segment files (1GB each by default) and since Kafka messages are not deleted when consumed (like in other message brokers) Kafka will not end up creating a fragmented filesystem over time by continuously creating and deleting many variable length files. Instead it creates segment files and then appends to that file until it reaches 1GB (a configurable limit). Only when all messages in the segment expire will it delete the entire 1GB segment. This means that often these 1GB sections of disk are actually laid out as contiguous blocks. It is a recommended best practice to keep these Kafka commit log files on a dedicated filesystem so it does not get fragmented by other apps reading and writing variable length files into the same filesystem. More importantly most reading an writing to these segment files is sequential and goes through OS page cache so as to reduce disk I/O even further by caching the most often accessed pages in memory. This is why it is a recommendation to tune the kernel to set swappiness to 1 to reduce the likelihood that these cached pages would get swapped out of memory.

125

answered Oct 12 '22 11:10

Hans Jespersen

Related questions
                            
                                Storing Avro schema in schema registry
                            
                                The benefits of Flink Kafka Stream over Spark Kafka Stream? And Kafka Stream over Flink? [closed]
                            
                                Kafka consumer does not start from latest message
                            
                                How to support multiple KeyBy in Flink
                            
                                How to remove/clear state stores in Kafka Streams?
                            
                                Counting all entries with KSQL
                            
                                Kafka Connect can't find connector
                            
                                Alpakka kafka vs Kafka streams
                            
                                Kafka High Level Consumer Fetch All Messages From Topic Using Java API (Equivalent to --from-beginning)
                            
                                Kafka unable to connect to Zookeeper
                            
                                KafkaProducer not successfully sending message into the queue
                            
                                Simple Kafka Consumer not receiving messages
                            
                                Why is Kafka consumer ignoring my "earliest" directive in the auto.offset.reset parameter and thus not reading my topic from the absolute first event?
                            
                                Uneven Distribution of messages in Kafka Partitions
                            
                                How can I retry failure messages from kafka?
                            
                                Java consumer group missing?
                            
                                What makes Kafka high in throughput?
                            
                                How Kafka Nodes and zookeeper will communicate with each other?
                            
                                Rails: How to listen to / pull from service or queue?
                            
                                Using Kafka Connect HOWTO "commit offsets" as soon as a "put" is completed in SinkTask

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With