I'm working with kafka and I've been asked with doing a validation of the message that are sent to Kafka, but I don't like the solutions I've thought that's why I hope someone can advice me on this.
We have many producers outside our control, so they can send any message with any kind of format, and we could have as much as 80 million records sent, and they should be dealt in less than 2 hours. I'v been asked to:
Validate the format (Json since it has to be compatible with mongoDB).
Validate some of the fields sent.
Rename some fields
The last 2 requestes are to be done using parameters stored in MongoDB. All of this should be done assuming we are not the only one making the consumers, so there should be a "simple" call to our service that makes this validation. Any ideas?
This is often done with a Kafka Streams job.
You have "raw" input topics where your producers send events. Then a Streams job reads from these topics and writes valid records into "clean" topics. In Streams you can do all sort of processing to check records or enrich them if needed.
You probably also want to write bad records into a dead letter queue topic so you can check why these happened.
Then your consumers can read from the clean topics to ensure they only see validated data.
This solution adds some latency to records as they have to be "processed" before being reaching consumers. You also want to run the Streams job close to the Kafka cluster as depending how much you want to validate, it might need to ingest large amount of data.
Also see Handling bad messages using Kafka's Streams API where some of these concepts are detailed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With