Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kafka Connect: How can I send protobuf data from Kafka topics to HDFS using hdfs sink connector?

Tags:

I have a producer that's producing protobuf messages to a topic. I have a consumer application which deserializes the protobuf messages. But hdfs sink connector picks up messages from the Kafka topics directly. What would the key and value converter in etc/schema-registry/connect-avro-standalone.properties be set to? What's the best way to do this? Thanks in advance!

like image 509
NoName Avatar asked Nov 23 '16 18:11

NoName


1 Answers

Kafka Connect is designed to separate the concern of serialization format in Kafka from individual connectors with the concept of converters. As you seem to have found, you'll need to adjust the key.converter and value.converter classes to implementations that support protobufs. These classes are commonly implemented as a normal Kafka Deserializer followed by a step which performs a conversion from serialization-specific runtime formats (e.g. Message in protobufs) to Kafka Connect's runtime API (which doesn't have any associated serialization format -- it's just a set of Java types and a class to define Schemas).

I'm not aware of an existing implementation. The main challenge in implementing this is that protobufs is self-describing (i.e. you can deserialize it without access to the original schema), but since its fields are simply integer IDs, you probably wouldn't get useful schema information without either a) requiring that the specific schema is available to the converter, e.g. via config (which makes migrating schemas more complicated) or b) a schema registry service + wrapper format for your data that allows you to look up the schema dynamically.

like image 79
Ewen Cheslack-Postava Avatar answered Sep 23 '22 16:09

Ewen Cheslack-Postava