I'm trying to validate a JSON file using an Avro schema and write the corresponding Avro file. First, I've defined the following Avro schema named user.avsc
:
{"namespace": "example.avro", "type": "record", "name": "user", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }
Then created a user.json
file:
{"name": "Alyssa", "favorite_number": 256, "favorite_color": null}
And then tried to run:
java -jar ~/bin/avro-tools-1.7.7.jar fromjson --schema-file user.avsc user.json > user.avro
But I get the following exception:
Exception in thread "main" org.apache.avro.AvroTypeException: Expected start-union. Got VALUE_NUMBER_INT at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:697) at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:441) at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:290) at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155) at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142) at org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:99) at org.apache.avro.tool.Main.run(Main.java:84) at org.apache.avro.tool.Main.main(Main.java:73)
Am I missing something? Why do I get "Expected start-union. Got VALUE_NUMBER_INT".
You can use either ConvertRecord or ConvertAvroToJSON to convert your incoming Avro data to JSON. If the incoming Avro files do not have a schema embedded in them, then you will have to provide it, either to an AvroReader (for ConvertRecord) or the "Avro schema" property (for ConvertAvroToJSON).
Avro is a row-oriented remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format.
Apache Avro ships with some very advanced and efficient tools for reading and writing binary Avro but their support for JSON to Avro conversion is unfortunately limited and requires wrapping fields with type declarations if you have some optional fields in your schema.
According to the explanation by Doug Cutting,
Avro's JSON encoding requires that non-null union values be tagged with their intended type. This is because unions like ["bytes","string"] and ["int","long"] are ambiguous in JSON, the first are both encoded as JSON strings, while the second are both encoded as JSON numbers.
http://avro.apache.org/docs/current/spec.html#json_encoding
Thus your record must be encoded as:
{"name": "Alyssa", "favorite_number": {"int": 7}, "favorite_color": null}
There is a new JSON encoder in the works that should address this common issue:
https://issues.apache.org/jira/browse/AVRO-1582
https://github.com/zolyfarkas/avro
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With