I work on a bench of my Kafka cluster in version 1.0.0-cp1.
In part of my bench who focus on the max throughput possible with ordering guarantee and no data loss (a topic with only one partition), need I to set the max.in.flight.requests.per.connection
property to 1
?
I've read this article
And I understand that I only have to set the max.in.flight to 1 if I enable the retry feature at my producer with the retries
property.
Another way to ask my question: Only one partition + retries=0 (producer props) is sufficient to guarantee the ordering in Kafka?
I need to know because increase the max.in.flight increases drastically the throughput.
Yes, you must set the max.in.flight.requests.per.connection
property to 1
.
In the article you have read it was an initial mistake (currently corrected) where author wrote:
max.in.flights.requests.per.session
which doesn't exist in the Kafka documentation.
This errata comes probably from the book "Kafka The Definitive Guide" (1st edition) where you can read in the page 52:
<...so if guaranteeing order is critical, we recommend setting
in.flight.requests.per.session=1
to make sure that while a batch of messages is retrying, additional messages will not be sent ...>
Your use case is slightly unclear. You mention ordering and no data loss but don't specify if you tolerate duplicate messages. So it's unclear if you want At least Once (QoS 1) or Exactly Once
Either way, as you're using 1.0.0 and only using a single partition, you should have a look at the Idempotent Producer instead of tweaking the Producer configs. It allows to properly and efficiently guarantee ordering and no data loss.
From the documentation:
Idempotent delivery ensures that messages are delivered exactly once to a particular topic partition during the lifetime of a single producer.
The early Idempotent Producer was forcing max.in.flight.requests.per.connection
to 1 (for the same reasons you mentioned) but in the latest releases it can now be used with max.in.flight.requests.per.connection
set to up to 5 and still keep its guarantees.
Using the Idempotent Producer you'll not only get stronger delivery semantics (Exactly Once instead of At least Once) but it might even perform better!
I recommend you check the delivery semantics [in the docs] [in the docs]:http://kafka.apache.org/documentation/#semantics
Back to your question
Yes without the idempotent (or transactional) producer, if you want to avoid data loss (QoS 1) and preserve ordering, you have to set max.in.flight.requests.per.connection
to 1, allow retries
and use acks=all
. As you saw this comes at a significant performance cost.
imo, it is invaluable to also know about this issue that makes things far more interesting and slightly more complicated.
When you enable enable.idempotence=true
, every time you send a message to the broker, you also send a sequence number
, starting from zero. Brokers store that sequence number too on their side. When you make a next request to the broker, let’s say with sequence_id=3
, the broker can look at its currently stored sequence number and say :
And now max.inflight.requests.per.connection
. A producer can send as many as this value concurrent requests without actually waiting for an answer from the broker. When we reach 3 (let’s say max.inflight.requests.per.connection=3
) , we start to ask the broker for the previous results (at the same time we can’t process any batches now even if they are ready).
Now, for the sake of the example, let’s say the broker says this : “1 was OK, I stored it”, “2 has failed” and now the important part: because 2 failed, the only possible thing you can get for 3 is “out of order”, which means it did not store it. The client now knows that it needs to reprocess 2
and 3
and it creates a List and resends them - in that exact order; if retry is enabled.
This explanation is probably over simplified, but this is my basic understanding after reading the source code a bit.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With