There is large amounts of data being pushed into one of our Kafka topics, is there a way to determine which producer this data is coming from?
Without SASL or Authorizer
level auditing, no there is not an easy way other than tracking down connected, suspicious client-id via JMX.
I would suggest you enforce a standard message format and spread the word to producer teams. For example, look at the Cloudevents spec, which includes a source field
https://github.com/cloudevents/spec/blob/master/kafka-protocol-binding.md
You can enable quotas for the clients/users, and then monitor which clients get throttled via two quota-related JMX MBeans - bandwidth and request rate:
Metric: Bandwidth quota metrics per (user, client-id), user or client-id
MBean:kafka.server:type={Produce|Fetch},user=([-.\w]+),client-id=([-.\w]+)
What it shows:: Two attributes. throttle-time indicates the amount of time in ms the client was throttled. Ideally = 0. byte-rate indicates the data produce/consume rate of the client in bytes/sec. For (user, client-id) quotas, both user and client-id are specified. If per-client-id quota is applied to the client, user is not specified. If per-user quota is applied, client-id is not specified.Metric: Request quota metrics per (user, client-id), user or client-id
MBean:kafka.server:type=Request,user=([-.\w]+),client-id=([-.\w]+)
What it shows: Two attributes. throttle-time indicates the amount of time in ms the client was throttled. Ideally = 0. request-time indicates the percentage of time spent in broker network and I/O threads to process requests from client group. For (user, client-id) quotas, both user and client-id are specified. If per-client-id quota is applied to the client, user is not specified. If per-user quota is applied, client-id is not specified.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With