generating an AVRO schema from a JSON document

Tags:

Is there any tool able to create an AVRO schema from a 'typical' JSON document.

For example:

{
"records":[{"name":"X1","age":2},{"name":"X2","age":4}]
}

I found http://jsonschema.net/reboot/#/ which generates a 'json-schema'

{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "id": "http://jsonschema.net#",
  "type": "object",
  "required": false,
  "properties": {
    "records": {
      "id": "#records",
      "type": "array",
      "required": false,
      "items": {
        "id": "#1",
        "type": "object",
        "required": false,
        "properties": {
          "name": {
            "id": "#name",
            "type": "string",
            "required": false
          },
          "age": {
            "id": "#age",
            "type": "integer",
            "required": false
          }
        }
      }
    }
  }
}

but I'd like an AVRO version.

269

asked Jul 03 '14 08:07

Pierre

1 Answers

You can achieve that easily using Apache Spark and python. First download the spark distribution from http://spark.apache.org/downloads.html, then install avro package for python using pip. Then run pyspark with avro package:

./bin/pyspark --packages com.databricks:spark-avro_2.11:3.1.0

and use the following code (assuming the input.json files contains one or more json documents, each in separate line):

import os, avro.datafile

spark.read.json('input.json').coalesce(1).write.format("com.databricks.spark.avro").save("output.avro")
avrofile = filter(lambda file: file.startswith('part-r-00000'), os.listdir('output.avro'))[0]

with open('output.avro/' + avrofile) as avrofile:
    reader = avro.datafile.DataFileReader(avrofile, avro.io.DatumReader())
    print(reader.datum_reader.writers_schema)

For example: for input file with content:

{'string': 'somestring', 'number': 3.14, 'structure': {'integer': 13}}
{'string': 'somestring2', 'structure': {'integer': 14}}

The script will result in:

{"fields": [{"type": ["double", "null"], "name": "number"}, {"type": ["string", "null"], "name": "string"}, {"type": [{"type": "record", "namespace": "", "name": "structure", "fields": [{"type": ["long", "null"], "name": "integer"}]}, "null"], "name": "structure"}], "type": "record", "name": "topLevelRecord"}

answered Sep 23 '22 06:09

Mariusz

Related questions
                            
                                Working with knitr using subdirectories
                            
                                SEC7118: XMLHttpRequest CORS - IE Console message
                            
                                Trigger File Download within iPython Notebook
                            
                                NSFetchedResultsController calls didChangeObject delete instead of update
                            
                                Should you install nginx inside docker? [closed]
                            
                                Why using std::forward on container before accessing element?
                            
                                Authenticate Angular js module for Apigility
                            
                                Android Studio Setup Wizard Stuck on Downloading Components
                            
                                Why organize projects inside a src folder?
                            
                                Unsafe attempt to load URL svg
                            
                                Why is genymotion running so slowly?
                            
                                How does UILabel vertically center its text?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With