Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you serialize a union field in Avro using Python when attributes match

Say you have this AVDL as a simplified example:

@namespace("example.avro")
protocol User {
   record Man {
      int age;
   }

   record Woman {
      int age;
   }

   record User {
      union {
        Man,
        Woman
      } user_info;
   }
}

in python you are not able to properly serialize objects stating the type because this syntax is not allowed:

{"user_info": {"Woman": {"age": 18}}}

and the only object that gets serialized is

{"user_info": {"age": 18}}

losing all the type information and the DatumWriter picking usually the first record that matches the set of fields, in this case a Man.

The above problem works perfectly well when using the Java API.

So, what am I doing wrong here? Is it possible that serialization and deserialization is not idempotent in Python's Avro implementation?

like image 372
tonicebrian Avatar asked Jan 25 '18 12:01

tonicebrian


People also ask

How does Avro serialization work?

Apache Avro is one of those data serialization systems. Avro is a language independent, schema-based data serialization library. It uses a schema to perform serialization and deserialization. Moreover, Avro uses a JSON format to specify the data structure which makes it more powerful.

What is Union in Avro schema?

A union indicates that a field might have more than one data type. For example, a union might indicate that a field can be a string or a null. A union is represented as a JSON array containing the data types.


1 Answers

You are correct that the standard avro library has no way to specify which schema to use in cases like this. However, fastavro (an alternative implementation) does have a way to do this. In that implementation, a record can be specified as a tuple where the first value is the schema name and the second value is the actual record data. The record would look like this:

{"user_info": ("Woman", {"age": 18})}

Here's and example script:

from io import BytesIO
from fastavro import writer

schema = {
    "type": "record",
    "name": "User",
    "fields": [{
        "name": "user_info",
        "type": [
            {
                "type": "record",
                "name": "Man",
                "fields": [{
                    "name": "age",
                    "type": "int"
                }]
            },
            {
                "type": "record",
                "name": "Woman",
                "fields": [{
                    "name": "age",
                    "type": "int"
                }]
            }
        ]
    }]
}

records = [{"user_info": ("Woman", {"age": 18})}]

bio = BytesIO()
writer(bio, schema, records)
like image 63
Scott Avatar answered Sep 18 '22 00:09

Scott