Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

use of "default" in avro schema

As per the definition of "default" attribute in Avro docs: "A default value for this field, used when reading instances that lack this field (optional)."

This means that if the corresponding field is missing, the default value is taken.

But this does not seem to be the case. Consider the following student schema:

{
        "type": "record",
        "namespace": "com.example",
        "name": "Student",
        "fields": [{
                "name": "age",
                "type": "int",
                "default": -1
            },
            {
                "name": "name",
                "type": "string",
                "default": "null"
            }
        ]
    }

Schema says that: if "age" field is missing, then consider value as -1. Likewise for "name" field.

Now, if I try to construct Student model, from the following JSON:

{"age":70}

I get this exception:

org.apache.avro.AvroTypeException: Expected string. Got END_OBJECT

    at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:698)
    at org.apache.avro.io.JsonDecoder.readString(JsonDecoder.java:227)

Looks like the default is NOT working as expected. So, What exactly is the role of default here ?

This is the code used to generate Student model:

Decoder decoder = DecoderFactory.get().jsonDecoder(Student.SCHEMA$, studentJson);
SpecificDatumReader<Student> datumReader = new SpecificDatumReader<>(Student.class);
return datumReader.read(null, decoder);

(Student class is auto-generated by Avro compiler from student schema)

like image 832
Pavan Avatar asked Feb 26 '18 09:02

Pavan


People also ask

What is default in Avro schema?

In this case, the type of the key1 field is Union (type: [null, string] in Avro Schema). If the key1 field in the source data is not transferred or the transferred value is null, null is automatically filled as the default value.

Does order matter in Avro schema?

Avro serializer/deserializers operate on fields in the order they are declared. Producers and Consumers must be on a compatible schema including the field order. Do not change the order of AVRO fields.

What is the use of Avro schema?

The use of Avro schemas allows serialized values to be stored in a very space-efficient binary format. Each value is stored without any metadata other than a small internal schema identifier, between 1 and 4 bytes in size. One such reference is stored per key-value pair.

What is the difference between Avro schema and JSON schema?

JSON Schema can describe a much broader set of data than Avro (Avro can only have strings in enums, for instance, while enums in JSON Schema can have any JSON value); but Avro has notions which are not available in JSON (property order in records, binary types).


1 Answers

I think there is some miss understanding around default values so hopefully my explanation will help to other people as well. The default value is useful to give a default value when the field is not present, but this is essentially when you are instancing an avro object (in your case calling datumReader.read) but it does not allow read data with a different schema, this is why the concept of "schema registry" is useful for this kind of situations.

The following code works and allow read your data

Decoder decoder = DecoderFactory.get().jsonDecoder(Student.SCHEMA$, "{\"age\":70}");
SpecificDatumReader<Student> datumReader = new SpecificDatumReader<>(Student.class);

Schema expected = new Schema.Parser().parse("{\n" +
        "  \"type\": \"record\",\n" +
        "  \"namespace\": \"com.example\",\n" +
        "  \"name\": \"Student\",\n" +
        "  \"fields\": [{\n" +
        "    \"name\": \"age\",\n" +
        "    \"type\": \"int\",\n" +
        "    \"default\": -1\n" +
        "  }\n" +
        "  ]\n" +
        "}");

datumReader.setSchema(expected);
System.out.println(datumReader.read(null, decoder));

as you can see, I am specifying the schema used to "write" the json input which does not contain the field "name", however (considering your schema contains a default value) when you print the records you will see the name with your default value

{"age": 70, "name": "null"}

Just in case, might or might not already know, that "null" is not really a null value is a string with value "null".

like image 116
hlagos Avatar answered Oct 12 '22 12:10

hlagos