Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why we need Avro schema evolution

Tags:

hadoop

avro

I am new to Hadoop and programming, and I am a little confused about Avro schema evolution. I will explain what I understand about Avro so far.

Avro is a serialization tool that stores binary data with its json schema at the top. The schema looks like this.

{
    "namespace":"com.trese.db.model",
    "type":"record",
    "doc":"This Schema describes about Product",
    "name":"Product",
    "fields":[
        {"name":"product_id","type": "long"},
        {"name":"product_name","type": "string","doc":"This is the name of the product"},
        {"name":"cost","type": "float", "aliases":["price"]},
        {"name":"discount","type": "float", "default":5}
    ]
}

Now my question is why we need evolution? I have read that we can use default in the schema for new fields; but if we add a new schema in the file, that earlier schema will be overwritten. We cannot have two schemas for a single file.

Another question is, what are reader and writer schemas and how do they help?

like image 776
Anaadih.pradeep Avatar asked Aug 25 '16 01:08

Anaadih.pradeep


People also ask

Why is Avro better for schema evolution?

A common trait shared by these platforms is that they used Apache Avro to provide strong schema-on-write data contracts. Importantly, Avro also offers the ability for customers to safely and confidently evolve their data model definitions. After all — we should expect the shape of data to change over time.

Why is Avro schema needed?

The use of Avro schemas allows serialized values to be stored in a very space-efficient binary format. Each value is stored without any metadata other than a small internal schema identifier, between 1 and 4 bytes in size. One such reference is stored per key-value pair.

What is schema evolution in Avro?

Schema evolution is the term used for how the store behaves when Avro schema is changed after data has been written to the store using an older version of that schema.

What is the use of Avro schema in Kafka?

In the Kafka world, Apache Avro is by far the most used serialization protocol. Avro is a data serialization system. Combined with Kafka, it provides schema-based, robust, and fast binary serialization.


1 Answers

If you have one avro file and you want to change its schema, you can rewrite that file with a new schema inside. But what if you have terabytes of avro files and you want to change their schema? Will you rewrite all of the data, every time the schema changes?

Schema evolution allows you to update the schema used to write new data, while maintaining backwards compatibility with the schema(s) of your old data. Then you can read it all together, as if all of the data has one schema. Of course there are precise rules governing the changes allowed, to maintain compatibility. Those rules are listed under Schema Resolution.

There are other use cases for reader and writer schemas, beyond evolution. You can use a reader as a filter. Imagine data with hundreds of fields, of which you are only interested in a handful. You can create a schema for that handful of fields, to read only the data you need. You can go the other way and create a reader schema which adds default data, or use a schema to join the schemas of two different datasets.

Or you can just use one schema, which never changes, for both reading and writing. That's the simplest case.

like image 129
jaco0646 Avatar answered Sep 28 '22 04:09

jaco0646