Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

generating an AVRO schema from a JSON document

Tags:

Is there any tool able to create an AVRO schema from a 'typical' JSON document.

For example:

{
"records":[{"name":"X1","age":2},{"name":"X2","age":4}]
}

I found http://jsonschema.net/reboot/#/ which generates a 'json-schema'

{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "id": "http://jsonschema.net#",
  "type": "object",
  "required": false,
  "properties": {
    "records": {
      "id": "#records",
      "type": "array",
      "required": false,
      "items": {
        "id": "#1",
        "type": "object",
        "required": false,
        "properties": {
          "name": {
            "id": "#name",
            "type": "string",
            "required": false
          },
          "age": {
            "id": "#age",
            "type": "integer",
            "required": false
          }
        }
      }
    }
  }
}

but I'd like an AVRO version.

like image 269
Pierre Avatar asked Jul 03 '14 08:07

Pierre


People also ask

Is Avro schema JSON?

An Avro schema is created using JSON format. JSON is short for JavaScript Object Notation, and it is a lightweight, text-based data interchange format that is intended to be easy for humans to read and write. JSON is described in a great many places, both on the web and in after-market documentation.

How do I get Avro schema from Avro file?

For those using the C# Avro Apache library, the utility function DataFileReader<GenericRecord>. OpenReader(filename); can be used to instantiate the dataFileReader . Once instantiated, it the dataFileReader is used just like in Java.

Can Avro be read as JSON?

Apache Avro ships with some very advanced and efficient tools for reading and writing binary Avro but their support for JSON to Avro conversion is unfortunately limited and requires wrapping fields with type declarations if you have some optional fields in your schema.


1 Answers

You can achieve that easily using Apache Spark and python. First download the spark distribution from http://spark.apache.org/downloads.html, then install avro package for python using pip. Then run pyspark with avro package:

./bin/pyspark --packages com.databricks:spark-avro_2.11:3.1.0

and use the following code (assuming the input.json files contains one or more json documents, each in separate line):

import os, avro.datafile

spark.read.json('input.json').coalesce(1).write.format("com.databricks.spark.avro").save("output.avro")
avrofile = filter(lambda file: file.startswith('part-r-00000'), os.listdir('output.avro'))[0]

with open('output.avro/' + avrofile) as avrofile:
    reader = avro.datafile.DataFileReader(avrofile, avro.io.DatumReader())
    print(reader.datum_reader.writers_schema)

For example: for input file with content:

{'string': 'somestring', 'number': 3.14, 'structure': {'integer': 13}}
{'string': 'somestring2', 'structure': {'integer': 14}}

The script will result in:

{"fields": [{"type": ["double", "null"], "name": "number"}, {"type": ["string", "null"], "name": "string"}, {"type": [{"type": "record", "namespace": "", "name": "structure", "fields": [{"type": ["long", "null"], "name": "integer"}]}, "null"], "name": "structure"}], "type": "record", "name": "topLevelRecord"}
like image 62
Mariusz Avatar answered Sep 23 '22 06:09

Mariusz