Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Need to understand kafka broker property "log.flush.interval.messages"

Tags:

apache-kafka

I want to understand log.flush.interval.messages setting in kafka broker.

The number of messages written to a log partition before we force an fsync on the log

Does it mean when it reaches the specified number of messages then it will write to another file in the disk? If so then:

  1. When consumer wants to read then it has to get it from disk which is time consuming. Is this correct?
  2. At the same time

    A message is only exposed to the consumers after it is flushed to Disk from segment file(http://notes.stephenholiday.com/Kafka.pdf)

    Then consumer always reads from disk as it cant read from segment file?

  3. What is the difference between storing in a segment file and on a disk?

like image 654
Goutam Chowdhury Avatar asked Nov 28 '15 10:11

Goutam Chowdhury


2 Answers

The first thing I want to warn you about is that that Kafka paper is a little bit outdated regarding how all of this works since at that time Kafka did not have replication. I suggest you to read (if not already did it) about this in the Replication Section of Kafka documentation.

As the paper mentions each arriving message is written to a segment file. But you have to recall that when you write to a file the data is not transferred to the disk device immediately, it is buffered first. The way to force this write to happen is by calling the fsync system call (see man fsync) and this is where "log.flush.interval.messages" and "log.flush.interval.ms" come into play. With these settings you can tell Kafka exactly when to do this flush (after certain number of messages or period of time). But please note that Kafka, in general, recommends you not to set these and use replication for durability and allow the operating system's background flush capabilities as it is more efficient (see Broker configs in Kafka documentation).

For the second part of your question, as it is mentioned in the Replication Section of Kafka documentation, only committed messages (a message is considered "committed" when all in sync replicas for that partition have applied it to their log) are ever given out to the consumer. This is to avoid consumers potentially seeing a message that could be lost (because it was not fsynced to disk yet) if the leader fails.

like image 167
Luciano Afranllie Avatar answered Dec 02 '22 20:12

Luciano Afranllie


@user1870400

Both log.flush.interval.ms and log.flush.interval.messages are set to Max. It make Kafka flush log to disk (eg. fsync in linux) only depends on file system. So, even you have set ack to 'all', none of follower relica (and leader itselft) ensures the log it fetch from leader has flush to disk. And if all the replicas crash before flushing, the log will lost. The reason why Kafka chooose such an 'unsafe' choice is because, just like the paper said:

Kafka avoid explicitly caching messages in memory at the Kafka layer. 
Kafka rely on the underlying file system page cache. 
This has the main benefit of avoiding double buffering---messages are only cached in the page cache. 
This has the additional benefit of retaining warm cache even when a broker process is restarted. 

In order to make better use of file system cache, kafka set both flush interval to max by default. If you want to get rid of lost message even N broker are crash, set topic-level config flush.messages or broker-level config log.flush.interval.messages to 1.

like image 39
djzhu Avatar answered Dec 02 '22 20:12

djzhu