Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recursive data type like a tree as Avro schema

Reading https://avro.apache.org/docs/current/spec.html it says a schema must be one of:

  • A JSON string, naming a defined type.
  • A JSON object, of the form: {"type": "typeName" ...attributes...} where typeName is either a primitive or derived type name, as defined below. Attributes not defined in this document are permitted as metadata, but must not affect the format of serialized data.
  • A JSON array, representing a union of embedded types.

I want a schema that describes a tree, using the recursive definition that a tree is either:

  • A node with a value (say, integer) and a list of trees (the children)
  • A leaf with a value

My initial attempt looked like:

{
  "name": "Tree",
  "type": [
    {
      "name": "Node",
      "type": "record",
      "fields": [
        {
          "name": "value",
          "type": "long"
        },
        {
          "name": "children",
          "type": { "type": "array", "items": "Tree" }
        }
      ]
    },
    {
      "name": "Leaf",
      "type": "record",
      "fields": [
        {
          "name": "value",
          "type": "long"
        }
      ]
    }
  ]
}

But the Avro compiler rejects this, complaining there is nothing of type {"name":"Tree","type":[{"name":"Node".... It seems Avro doesn't like the union type at the top-level. I'm guessing this falls under the aforementioned rule "a schema must be one of .. a JSON object .. where typeName is either a primitive or derived type name." I am not sure what a "derived type name" is though. At first I thought it was the same as a "complex type" but that includes union types..

Anyways, changing it to the more convoluted definition:

{
  "name": "Tree",
  "type": "record",
  "fields": [{
    "name": "ctors",
    "type": [
      {
        "name": "Node",
        "type": "record",
        "fields": [
          {
            "name": "value",
            "type": "long"
          },
          {
            "name": "children",
            "type": { "type": "array", "items": "Tree" }
          }
        ]
      },
      {
        "name": "Leaf",
        "type": "record",
        "fields": [
          {
            "name": "value",
            "type": "long"
          }
        ]
      }
    ]
  }]
}

works, but now I have this weird record with just a single field whose sole purpose is to let me define the top-level union type I want.

Is this the only way to get what I want in Avro or is there a better way?

Thanks!

like image 534
adelbertc Avatar asked Oct 19 '17 21:10

adelbertc


1 Answers

If you represent a Tree as a node, and a Leaf as a node with an empty list of children, you can avoid the named union problem completely, and do this quite simply with one recursive type:

{
  "type": "record",
  "name": "TreeNode",
  "fields": [
    {
      "name": "value",
      "type": "long"
    },
    {
      "name": "children",
      "type": { "type": "array", "items": "TreeNode" }
    }
  ]
}

Now, your three types Tree, Node, and Leaf are unified into one type TreeNode, and there is no union of Node and Leaf necessary.

like image 193
Raman Avatar answered Oct 04 '22 21:10

Raman