Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Avro schemas not compatible if field order changes

Tags:

java

schema

avro

Scenario - Client serializes a POJO using Avro Reflect Datum Writer and writes GenericRecord to a file. The schema obtained through reflection is something like this (Note the ordering A, B, D, C) -

{
"namespace": "storage.management.example.schema",

"type": "record",
"doc": "Example schema for testing",
"name": "Event",
"fields": [
     ....
     ....
    { "name": "A", "type":  "string"  },
    { "name": "B", "type":  "string"  },
    { "name": "D", "type": "string" },
    { "name": "C", "type":  "string"  },
     ....
     ....
]
} 

An agent reads off the file and uses a default schema (Note the ordering - A, B, C, D)to deserialize a subset of the record (The client is guaranteed to have these fields)

{
"namespace": "storage.management.example.schema",
"type": "record",
"doc": "Example schema for testing",
"name": "Event",
"fields": [
    { "name": "A", "type":  "string"  },
    { "name": "B", "type":  "string"  },
    { "name": "C", "type": "string" },
    { "name": "D", "type":  "string"  }
]
}

The problem : De-serialization with the above subset schema results in the following exception -

Caused by: java.io.IOException: Invalid int encoding
at org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:145)
at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:259)
at org.apache.avro.io.ResolvingDecoder.readString(ResolvingDecoder.java:201)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:430)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:422)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:180)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:240)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:230)

However, de-serialization succeeds if the subset schema also specifies fields in the order A, B, D, C. (same as client schema)

Is this behavior expected? I though Avro only depends on field name to build the record and not the ordering.

Any fixes to this ? Different clients may have different orders and I have no way to enforce ordering because schema is generated through reflection.

like image 546
Abhishek Kannan Avatar asked Aug 24 '17 04:08

Abhishek Kannan


2 Answers

This is not necessarily expected behavior. You might be making the same mistake I made when I began using Avro.

Avro is able to have different versions of schemas (e.g., write with one but read into another) but one thing very easily missed (at least by myself) is that you must have the exact schema that wrote the message when trying to read it.

The documentation and information you read about Avro, at least at the surface level, doesn't make that very clear. Usually they focus on it being "backwards compatible." To be fair, it is in a sense, but usually when people see that phrase they think it means something a little different. Usually we think that means you can work with old messages using a new schema, not work with old messages using a new schema and the old messages' schema.

As an example, see this pseudocode

Schema myUnsortedSchema has C B A order
Schema myAlphabeticalSchema has A B C order

Writer writer uses myUnsortedSchema
Reader badReader uses myAlphabeticalSchema only

writer writes message
badReader reads message

Error! Not sure what the error message will say exactly, but the problem is that badReader not only tries to read into myAlphabeticalSchema but also read the message as if it were written by myAlphabeticalSchema. The solution is that there is a way to give it both schemas, the one that wrote the message and the one to read into (how depends on the language).

Reader goodReader reads messages written with myUnsortedSchema into myAlphabeticalSchema

goodReader reads message

No error! This is the correct usage.

If you are using an approach like goodReader then this behavior is unexpected, but if you are using an approach like badReader then the behavior is expected.


Some services like Schema Registry help with this by appending some metadata to the front of the message bytes to determine which schema wrote the message (and stripping them off before reading of course). It's out of the scope of the question but can help solve problems like this.

like image 62
Captain Man Avatar answered Sep 22 '22 17:09

Captain Man


The ordering of fields may be different: fields are matched by name. https://avro.apache.org/docs/1.8.1/spec.html .... in your first schema there are other fields as well which you havent shown

like image 24
kushagra deep Avatar answered Sep 23 '22 17:09

kushagra deep