Why we need Avro schema evolution

Tags:

avro

I am new to Hadoop and programming, and I am a little confused about Avro schema evolution. I will explain what I understand about Avro so far.

Avro is a serialization tool that stores binary data with its json schema at the top. The schema looks like this.

{
    "namespace":"com.trese.db.model",
    "type":"record",
    "doc":"This Schema describes about Product",
    "name":"Product",
    "fields":[
        {"name":"product_id","type": "long"},
        {"name":"product_name","type": "string","doc":"This is the name of the product"},
        {"name":"cost","type": "float", "aliases":["price"]},
        {"name":"discount","type": "float", "default":5}
    ]
}

Now my question is why we need evolution? I have read that we can use default in the schema for new fields; but if we add a new schema in the file, that earlier schema will be overwritten. We cannot have two schemas for a single file.

Another question is, what are reader and writer schemas and how do they help?

776

asked Aug 25 '16 01:08

Anaadih.pradeep

1 Answers

If you have one avro file and you want to change its schema, you can rewrite that file with a new schema inside. But what if you have terabytes of avro files and you want to change their schema? Will you rewrite all of the data, every time the schema changes?

Schema evolution allows you to update the schema used to write new data, while maintaining backwards compatibility with the schema(s) of your old data. Then you can read it all together, as if all of the data has one schema. Of course there are precise rules governing the changes allowed, to maintain compatibility. Those rules are listed under Schema Resolution.

There are other use cases for reader and writer schemas, beyond evolution. You can use a reader as a filter. Imagine data with hundreds of fields, of which you are only interested in a handful. You can create a schema for that handful of fields, to read only the data you need. You can go the other way and create a reader schema which adds default data, or use a schema to join the schemas of two different datasets.

Or you can just use one schema, which never changes, for both reading and writing. That's the simplest case.

129

answered Sep 28 '22 04:09

jaco0646

Related questions
                            
                                HiveQL UNION ALL
                            
                                Why do we need Hadoop passwordless ssh?
                            
                                Computing median in map reduce
                            
                                Exception in thread "main" java.lang.NoClassDefFoundError: com/google/common/base/Preconditions
                            
                                Is Hadoop right for running my simulations?
                            
                                How can I calculate exact median with Apache Spark?
                            
                                How to start development for mahout
                            
                                How to choose between apache ranger and sentry
                            
                                how to use hadoop for a web application?
                            
                                Why does the Hadoop incompatible namespaceIDs issue happen?
                            
                                override log4j.properties in hadoop
                            
                                Hadoop: require root's password after enter "start-all.sh"
                            
                                Skipping the header while loading the text file using Piglatin
                            
                                copyFromLocal: `/user/hduser/gutenberg': No such file or directory
                            
                                HBase getting all timestamped values for a cell
                            
                                how to sort numerically in hadoop's shuffle/sort phase?
                            
                                Hadoop native libraries not found on OS/X
                            
                                Is there any Conditional IF like operator in Apache PIG?
                            
                                Python Connection to Hive
                            
                                How to read a .deflate file in hadoop

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With