Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to filter messages before passing them on to consumers?

I'm creating a lead and event management system with Kafka. The problem is we are getting many fake leads (advertisement). We also have many consumer in our system. Is there anyway to filter advertisement before going to consumers? My solution is to write everything into the first topic, then read it by a filter consumer, then write it back to the second topic or filter it. But I'm not sure if it's efficient or not. Any idea?

like image 680
user1079877 Avatar asked Jun 18 '15 12:06

user1079877


People also ask

Can a Kafka consumer filter messages before polling all of them from a topic?

No, with the Consumer you cannot only receive some messages from topics.

What is message filtering?

Message filtering enables you select the criteria at which you want messages to display in Mailbox Server. Message filtering limits the number of messages and attachments that the system displays. This makes it easier to locate the messages that you want to track.

Can Kafka filter events?

Kafka doesn't support filtering ability for consumers. If a consumer needs to listen to a sub-set of messages published on to a Kafka topic, consumer has to read all & filter only what is needed. This is in-efficient as all the messages are to be deserialized & make such a decision.


3 Answers

You can use Kafka Streams (http://kafka.apache.org/documentation.html#streamsapi) with 0.10.+ version of Kafka. It's exactly for your use case i think.

like image 161
JongHyok Lee Avatar answered Oct 21 '22 11:10

JongHyok Lee


Yes -- in fact I am mostly convinced that this is the way you're supposed to handle a problem in your context. Because Kafka is only useful for the efficient transmission of data, there is nothing it itself can do in terms of cleaning your data. Consume all the information you get by an intermediary consumer that can run its own tests to determine what passes its filter and push to a different topic / partition (based on your needs) to get the best data back.

like image 39
Jeff Gong Avatar answered Oct 21 '22 11:10

Jeff Gong


You can use Spark Streaming: https://spark.apache.org/docs/latest/streaming-kafka-integration.html.

like image 1
Nikita Shamgunov Avatar answered Oct 21 '22 10:10

Nikita Shamgunov