Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can a Kafka consumer(0.8.2.2) read messages in batch

As per my understanding Kafka consumer reads messages from an assigned partition sequentially...

We are planning to have multiple Kafka consumer (Java) which has same group I'd ..so if it reads sequentially from an assigned partition then how we can achieve high throughput ..i.e. For Example Producer publishes messages like 40 per sec ... Consumer process msg 1 per sec ..though we can have multiple consumers but cannot have 40 rt??? Correct me if I'm wrong...

And in our case consumer have to commit offset only after message is processed successfully ..else message will be reprocessed... Is there any better solution???

like image 423
shiv455 Avatar asked Feb 25 '16 09:02

shiv455


People also ask

Can Kafka do batch processing?

Accordingly, batch processing can be easily implemented with Apache Kafka, the advantages of Apache Kafka can be used, and the operation can be made efficient.

Can a Kafka consumer read from multiple topics?

Yes, Kafka's design allows consumers from one consumer group to consume messages from multiple topics.

Can one consumer read from multiple partitions?

When the number of consumers is lower than partitions, same consumers are going to read messages from more than one partition. In your scenario, a single consumer is going to read from all your partitions. This type of consumer is known as exclusive consumer. This happens when consumer groups have only one consumer.

What is batch size in Kafka?

batch. size is the maximum number of bytes that will be included in a batch. The default is 16KB . Increasing a batch size to 32KB or 64KB can help increase the compression, throughput, and efficiency of requests. Any message that is bigger than the batch size will not be batched.


1 Answers

Based on your question clarification.

A Kafka Consumer can read multiple messages at a time. But a Kafka Consumer doesn't really read messages, its more correct to say a Consumer reads a certain number of bytes and then based on the size of the individual messages, that determines how many messages will be read. Reading through the Kafka Consumer Configs, you're not allowed to specify how many messages to fetch, you specify a max/min data size that a consumer can fetch. However many messages fit inside that range is how many you will get. You will always get messages sequentially as you have pointed out.

Related Consumer Configs (for 0.9.0.0 and greater)

  • fetch.min.bytes
  • max.partition.fetch.bytes

UPDATE

Using your example in the comments, "my understanding is if i specify in config to read 10 bytes and if each message is 2 bytes the consumer reads 5 messages at a time." That is true. Your next statement, "that means the offsets of these 5 messages were random with in partition" that is false. Reading sequential doesn't mean one by one, it just means that they remain ordered. You are able to batch items and have them remain sequential/ordered. Take the following examples.

In a Kafka log, if there are 10 messages (each 2 bytes) with the following offsets, [0,1,2,3,4,5,6,7,8,9].

If you read 10 bytes, you'll get a batch containing the messages at offsets [0,1,2,3,4].

If you read 6 bytes, you'll get a batch containing the messages at offsets [0,1,2].

If you read 6 bytes, then another 6 bytes, you'll get two batches containing the messages [0,1,2] and [3,4,5].

If you read 8 bytes, then 4 bytes, you'll get two batches containing the messages [0,1,2,3] and [4,5].

Update: Clarifying Committing

I'm not 100% sure how committing works, I've mainly worked with Kafka from a Storm environment. The provided KafkaSpout automatically commits Kafka messages.

But looking through the 0.9.0.1 Consumer APIs, which I would recommend you do to. There seems to be three methods in particular that are relevant to this discussion.

  • poll(long timeout)
  • commitSync()
  • commitSync(java.util.Map offsets)

The poll method retrieves messages, could be only 1, could be 20, for your example lets say 3 messages were returned [0,1,2]. You now have those three messages. Now it's up you to determine how to process them. You could process them 0 => 1 => 2, 1 => 0 => 2, 2 => 0 => 1, it just depends. However you process them, after processing you'll want to commit which tells the Kafka server you're done with those messages.

Using the commitSync() commits everything returned on last poll, in this case it would commit offsets [0,1,2].

On the other hand, if you choose to use commitSync(java.util.Map offsets), you can manually specify which offsets to commit. If you're processing them in order, you can process offset 0 then commit it, process offset 1 then commit it, finally process offset 2 and commit.

All in all, Kafka gives you the freedom to process messages how to desire, you can choose to process them sequentially or entirely random at your choosing.

like image 143
Morgan Kenyon Avatar answered Oct 11 '22 12:10

Morgan Kenyon