I'm creating a lead and event management system with Kafka. The problem is we are getting many fake leads (advertisement). We also have many consumer in our system. Is there anyway to filter advertisement before going to consumers? My solution is to write everything into the first topic, then read it by a filter consumer, then write it back to the second topic or filter it. But I'm not sure if it's efficient or not. Any idea?
No, with the Consumer you cannot only receive some messages from topics.
Message filtering enables you select the criteria at which you want messages to display in Mailbox Server. Message filtering limits the number of messages and attachments that the system displays. This makes it easier to locate the messages that you want to track.
Kafka doesn't support filtering ability for consumers. If a consumer needs to listen to a sub-set of messages published on to a Kafka topic, consumer has to read all & filter only what is needed. This is in-efficient as all the messages are to be deserialized & make such a decision.
You can use Kafka Streams (http://kafka.apache.org/documentation.html#streamsapi) with 0.10.+ version of Kafka. It's exactly for your use case i think.
Yes -- in fact I am mostly convinced that this is the way you're supposed to handle a problem in your context. Because Kafka is only useful for the efficient transmission of data, there is nothing it itself can do in terms of cleaning your data. Consume all the information you get by an intermediary consumer that can run its own tests to determine what passes its filter and push to a different topic / partition (based on your needs) to get the best data back.
You can use Spark Streaming: https://spark.apache.org/docs/latest/streaming-kafka-integration.html.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With