Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Nesting Avro schemas

According to this question on nesting Avro schemas, the right way to nest a record schema is as follows:

{
    "name": "person",
    "type": "record",
    "fields": [
        {"name": "firstname", "type": "string"},
        {"name": "lastname", "type": "string"},
        {
            "name": "address",
            "type": {
                        "type" : "record",
                        "name" : "AddressUSRecord",
                        "fields" : [
                            {"name": "streetaddress", "type": "string"},
                            {"name": "city", "type": "string"}
                        ]
                    },
        }
    ]
}

I don't like giving the field the name address and having to give a different name (AddressUSRecord) to the field's schema. Can I give the field and schema the same name, address?

What if I want to use the AddressUSRecord schema in multiple other schemas, not just person? If I want to use AddressUSRecord in another schema, let's say business, do I have to name it something else?

Ideally, I'd like to define AddressUSRecord in a separate schema, then let the type of address reference AddressUSRecord. However, it's not clear that Avro 1.8.1 supports this out-of-the-box. This 2014 article shows that sub-schemas need to be handled with custom code. What the best way to define reusable schemas in Avro 1.8.1?

Note: I'd like a solution that works with Confluent Inc.'s Schema Registry. There's a Google Groups thread that seems to suggest that Schema Registry does not play nice with schema references.

like image 857
Tianxiang Xiong Avatar asked Nov 28 '16 22:11

Tianxiang Xiong


People also ask

What are Avro schemas?

Avro schema definitions are JSON records. Because it is a record, it can define multiple fields which are organized in a JSON array. Each such field identifies the field's name as well as its type. The type can be something simple, like an integer, or something complex, like another record.

Do Avro files contain schema?

Avro uses a schema to structure the data that is being encoded. It has two different types of schema languages; one for human editing (Avro IDL) and another which is more machine-readable based on JSON.

Does Avro support schema evolution?

A common trait shared by these platforms is that they used Apache Avro to provide strong schema-on-write data contracts. Importantly, Avro also offers the ability for customers to safely and confidently evolve their data model definitions. After all — we should expect the shape of data to change over time.

What is Avro and example?

Avro is an open-source schema specification for data serialization that provides serialization and data exchange services for Apache Hadoop. Avro is a language-agnostic format that can be used for any language that facilitates the exchange of data between programs.


1 Answers

Can I give the field and schema the same name, address?

Yes, you can name the record with the same name as the field name.

What if I want to use the AddressUSRecord schema in multiple other schemas, not just person?

You can use multiple schemas using a couple of techniques: the avro schema parser clients (JVM and others) allow you to specify multiple schemas, usually through the names parameter (the Java Schema$Parser/parse method allows multiple schema String arguments).

You can then specify dependant Schemas as a named type:

{
  "type": "record",
  "name": "Address",
  "fields": [
    {
      "name": "streetaddress",
      "type": "string"
    },
    {
      "name": "city",
      "type": "string"
    }
  ]
}

And run this through the parser before the parent schema:

{
  "name": "person",
  "type": "record",
  "fields": [
    {
      "name": "firstname",
      "type": "string"
    },
    {
      "name": "lastname",
      "type": "string"
    },
    {
      "name": "address",
      "type": "Address"
    }
  ]
}

Incidentally, this allows you to parse from separate files.

Alternatively, you can also parse a single Union schema that references schemas in the same way:

[
  {
    "type": "record",
    "name": "Address",
    "fields": [
      {
        "name": "streetaddress",
        "type": "string"
      },
      {
        "name": "city",
        "type": "string"
      }
    ]
  },
  {
    "type": "record",
    "name": "person",
    "fields": [
      {
        "name": "firstname",
        "type": "string"
      },
      {
        "name": "lastname",
        "type": "string"
      },
      {
        "name": "address",
        "type": "Address"
      }
    ]
  }
]

I'd like a solution that works with Confluent Inc.'s Schema Registry.

The schema registry does not support parsing schemas separately, but it does support the latter example of parsing into a union type.

like image 175
Niel Drummond Avatar answered Sep 19 '22 03:09

Niel Drummond