Scenario - Client serializes a POJO using Avro Reflect Datum Writer and writes GenericRecord to a file. The schema obtained through reflection is something like this (Note the ordering A, B, D, C) -
{
"namespace": "storage.management.example.schema",
"type": "record",
"doc": "Example schema for testing",
"name": "Event",
"fields": [
....
....
{ "name": "A", "type": "string" },
{ "name": "B", "type": "string" },
{ "name": "D", "type": "string" },
{ "name": "C", "type": "string" },
....
....
]
}
An agent reads off the file and uses a default schema (Note the ordering - A, B, C, D)to deserialize a subset of the record (The client is guaranteed to have these fields)
{
"namespace": "storage.management.example.schema",
"type": "record",
"doc": "Example schema for testing",
"name": "Event",
"fields": [
{ "name": "A", "type": "string" },
{ "name": "B", "type": "string" },
{ "name": "C", "type": "string" },
{ "name": "D", "type": "string" }
]
}
The problem : De-serialization with the above subset schema results in the following exception -
Caused by: java.io.IOException: Invalid int encoding
at org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:145)
at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:259)
at org.apache.avro.io.ResolvingDecoder.readString(ResolvingDecoder.java:201)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:430)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:422)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:180)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:240)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:230)
However, de-serialization succeeds if the subset schema also specifies fields in the order A, B, D, C. (same as client schema)
Is this behavior expected? I though Avro only depends on field name to build the record and not the ordering.
Any fixes to this ? Different clients may have different orders and I have no way to enforce ordering because schema is generated through reflection.
This is not necessarily expected behavior. You might be making the same mistake I made when I began using Avro.
Avro is able to have different versions of schemas (e.g., write with one but read into another) but one thing very easily missed (at least by myself) is that you must have the exact schema that wrote the message when trying to read it.
The documentation and information you read about Avro, at least at the surface level, doesn't make that very clear. Usually they focus on it being "backwards compatible." To be fair, it is in a sense, but usually when people see that phrase they think it means something a little different. Usually we think that means you can work with old messages using a new schema, not work with old messages using a new schema and the old messages' schema.
As an example, see this pseudocode
Schema myUnsortedSchema has C B A order
Schema myAlphabeticalSchema has A B C order
Writer writer uses myUnsortedSchema
Reader badReader uses myAlphabeticalSchema only
writer writes message
badReader reads message
Error! Not sure what the error message will say exactly, but the problem is that badReader
not only tries to read into myAlphabeticalSchema
but also read the message as if it were written by myAlphabeticalSchema
. The solution is that there is a way to give it both schemas, the one that wrote the message and the one to read into (how depends on the language).
Reader goodReader reads messages written with myUnsortedSchema into myAlphabeticalSchema
goodReader reads message
No error! This is the correct usage.
If you are using an approach like goodReader
then this behavior is unexpected, but if you are using an approach like badReader
then the behavior is expected.
Some services like Schema Registry help with this by appending some metadata to the front of the message bytes to determine which schema wrote the message (and stripping them off before reading of course). It's out of the scope of the question but can help solve problems like this.
The ordering of fields may be different: fields are matched by name. https://avro.apache.org/docs/1.8.1/spec.html .... in your first schema there are other fields as well which you havent shown
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With