Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

can a kafka consumer filter messages before polling all of them from a topic?

It was said that consumers can only read the whole topic. No luck doing evaluations on brokers to filter messages.

It implies that we have to consume/receive all messages from a topic and filter them on the client side.

That's too much. I was wondering if we can filter and receive specific types of messages, based on somethings already passed to brokers, such as the msg keys or other things.

from the method, Consumer.poll(timeout), it seems no extra things we can do.

like image 394
mikej1688 Avatar asked Jun 26 '18 19:06

mikej1688


3 Answers

No, with the Consumer you cannot only receive some messages from topics. The consumer fetches all messages in order.

If you don't want to filter messages in the Consumer, you could use a Streams job. For example, Streams would read from your topic and only push to another topic the messages the consumer is interested in. Then the consumer can subscribe to this new topic.

like image 183
Mickael Maison Avatar answered Nov 19 '22 23:11

Mickael Maison


Each Kafka topic should contain messages that are logically similar, just to stay on topic. Now, sometimes it might happen that you have a topic, let's say fruits, which contains different attributes of the fruit (maybe in json format). You may have different fruits messages pushed by the producers, but want one of your consumer group to process only apples. Ideally you might have gone with topic names with individual fruit name, but let's assume that to be a fruitless endeavor for some reason (maybe too many topics). In that case, you can override the default partitioning scheme in Kafka to ignore the key and do a random partitioning, and then pass your custom-partitioner class through the partitioner.class property in the producer, that puts the fruit name in the msg key. This is required because by default if you put the key while sending a message, it will always go to the same partition, and that might cause partition imbalance.

The idea behind this is sometimes if your Kafka msg value is a complex object (json, avro-record etc) it might be quicker to filter the record based on key, than parsing the whole value, and extracting the desired field. I don't have any data right now, to support the performance benefit of this approach though. It's only an intuition.

like image 20
Bitswazsky Avatar answered Nov 20 '22 01:11

Bitswazsky


Once records are already pushed into Kafka cluster, there is not much that you can do. Whatever you want to filter, you will always have to bring the chunks of data to the client.

Unfortunately, the only option is to pass that logic to the Producers, in that way you can push the data into multiple topics based on particular logic you can define.

like image 2
dbustosp Avatar answered Nov 20 '22 00:11

dbustosp