Working with Confluent Platform, the platform offered by the creators of Apache Kafka, and I have a question:
In the documentation of the Schema Registry API Reference, they mention the abstraction of a "Subject". You register a schema under a "subject" which is of the form topicName-key, or topicName-value, yet there is no explanation as to why you need (as it implies) a separate schema for the key and value of messages on a given topic. Nor is there any direct statement to the effect that registration with a "subject" necessarily associates the schema with that topic, other than mnemonically.
Further confusing matters, the subsequent examples ("get schema version for subject" and "register new schema under subject") on that page do not use that format for the subject name, and instead use just a topic name for the "subject" value. If anyone has any insight into a) why there are these two "subjects" per topic, and b) what the proper usage is, it would be greatly appreciated.
A subject is a name under which the schema is registered, In Schema Registry, a subject can be registered depends on the strategy type. When schemas evolve, they are still associated with the same subject but get a new schema ID and version.
Schemas, Subjects, and Topics A schema defines the structure of the data format. The Kafka topic name can be independent of the schema name. Schema Registry defines a scope in which schemas can evolve, and that scope is the subject .
Confluent Schema Registry is actually a bit inconsistent with subject names :)
Indeed, the KafkaAvroSerializer
(used for new Kafka 0.8.2 producer) uses topic-key|value
pattern for subjects (link) whereas KafkaAvroEncoder
(for old producer) uses schema.getName()-value
pattern (link).
The reason why one would have 2 different subjects per topic (one for key, one for value) is pretty simple:
say I have an Avro schema representing a log entry, and each log entry has a source information attached to it:
{
"type":"record",
"name":"LogEntry",
"fields":[
{
"name":"line",
"type":"string"
},
{
"name":"source",
"type":{
"type":"record",
"name":"SourceInfo",
"fields":[
{
"name":"host",
"type":"string"
},
{
"name":"...",
"type":"string"
}
]
}
}
]
}
A common use case would be that I want to partition entries by source, thus would like to have two subjects associated for topic (and subjects are basically revisions of Avro schemas) - one for key (which is SourceInfo
) and one for value (LogEntry
).
Having these two subjects would allow partitioning and storing the data as long as I have a schema registry running and my producers/consumers can talk to it. Any modifications to these schemas would be reflected in the schema registry and as long as they satisfy compatibility settings everything should just serialize/deserialize without you having to care about this.
Note: any further information is just my personal thoughts and maybe I just don't yet fully understand how this is supposed to work so I might be wrong.
I actually like more how the KafkaAvroEncoder
is implemented rather than the KafkaAvroSerializer
. KafkaAvroEncoder
does not in any way enforce you to use ONE schema per topic key\value whereas KafkaAvroSerializer
does. This might be an issue when you plan to produce data for multiple Avro schemas into one topic. In this case KafkaAvroSerializer
would try to update the topic-key
and topic-value
subjects and 99% would break if compatibility is violated (and if you have multiple Avro schemas they are almost always different and incompatible with each other).
On the other side, KafkaAvroEncoder
cares just about schema names and you may safely produce data for multiple Avro schemas into one topic and everything should work just fine (you will have as many subjects as schemas).
This inconsistency is still unclear to me and I hope Confluent guys can explain this if they see this question/answer.
Hope that helps you
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With