What is the difference and meaning of these two statements that I encountered during a lecture here: <pre class="prettyprint"><code>1. Traditional databases enforce schema during load time. </code></pre> and <pre class="prettyprint"><code>2. Hive enforces schema during read time. </code></pre>

You touch on one of the reasons why Hadoop and other NoSQL strategies have been so successful, so I'm not sure if you were expecting to get a dissertation or not, but here it is! The extra flexibility and agility in data analysis has probably contributed to the explosion of "data science", just because it makes large-scale data analysis easier in general. A traditional relational database stores the data with schema in mind. It knows that the second column is an integer, it knows that it has 40 columns, etc. Therefore, you need to specify your schema ahead of time and have it well planned out. This is "schema on write" -- that is, the schema is applied when the data is being written to the data store. Hive (in some cases), Hadoop, and many other NoSQL systems in general are about "schema on read" -- the schema is applied as the data is being read off of the data store. Consider the following line of raw text: <pre class="prettyprint"><code>A:B:C~E:F~G:H~~I::J~K~L </code></pre> There are a couple ways to interpret this. <code>~</code> could be the delimiter or maybe <code>:</code> could be the delimiter. Who knows? With schema on read, it doesn't matter. You decide what the schema is when you analyze the data, not when you write the data. This example is a bit ridiculous in that you probably won't ever encounter this case, but it gets the point across hopefully. With schema on read, you just load your data into the data store and think about how to parse and interpret later. At the core of this explanation, schema on read means write your data first, figure out what it is later. Schema on write means figure out what your data is first, then write it after. <hr> There is a tradeoff here. Some of these are subjective and my own opinion. Benefits of schema on write: <ul> <li>Better type safety and data cleansing done for the data at rest</li> <li>Typically more efficient (storage size and computationally) since the data is already parsed</li> </ul> Downsides of schema on write: <ul> <li>You have to plan ahead of time what your schema is before you store the data (i.e., you have to do ETL)</li> <li>Typically you throw away the original data, which could be bad if you have a bug in your ingest process</li> <li>It's harder to have different views of the same data</li> </ul> Benefits of schema on read: <ul> <li>Flexibility in defining how your data is interpreted at load time <ul> <li>This gives you the ability to evolve your "schema" as time goes on</li> <li>This allows you to have different versions of your "schema"</li> <li>This allows the original source data format to change without having to consolidate to one data format</li> </ul> </li> <li>You get to keep your original data</li> <li>You can load your data before you know what to do with it (so you don't drop it on the ground)</li> <li>Gives you flexibility in being able to store unstructured, unclean, and/or unorganized data</li> </ul> Downsides of schema on read: <ul> <li>Generally it is less efficient because you have to reparse and reinterpret the data every time (this can be expensive with formats like XML)</li> <li>The data is not self-documenting (i.e., you can't look at a schema to figure out what the data is)</li> <li>More error prone and your analytics have to account for dirty data</li> </ul>

Hive enforces schema during read time?

Tags:

hadoop

hive

mapreduce

hdfs

What is the difference and meaning of these two statements that I encountered during a lecture here:

1. Traditional databases enforce schema during load time.

and

2. Hive enforces schema during read time.

852

asked Aug 01 '12 17:08

London guy

1 Answers

You touch on one of the reasons why Hadoop and other NoSQL strategies have been so successful, so I'm not sure if you were expecting to get a dissertation or not, but here it is! The extra flexibility and agility in data analysis has probably contributed to the explosion of "data science", just because it makes large-scale data analysis easier in general.

A traditional relational database stores the data with schema in mind. It knows that the second column is an integer, it knows that it has 40 columns, etc. Therefore, you need to specify your schema ahead of time and have it well planned out. This is "schema on write" -- that is, the schema is applied when the data is being written to the data store.

Hive (in some cases), Hadoop, and many other NoSQL systems in general are about "schema on read" -- the schema is applied as the data is being read off of the data store. Consider the following line of raw text:

A:B:C~E:F~G:H~~I::J~K~L

There are a couple ways to interpret this. ~ could be the delimiter or maybe : could be the delimiter. Who knows? With schema on read, it doesn't matter. You decide what the schema is when you analyze the data, not when you write the data. This example is a bit ridiculous in that you probably won't ever encounter this case, but it gets the point across hopefully.

With schema on read, you just load your data into the data store and think about how to parse and interpret later. At the core of this explanation, schema on read means write your data first, figure out what it is later. Schema on write means figure out what your data is first, then write it after.

There is a tradeoff here. Some of these are subjective and my own opinion.

Benefits of schema on write:

Better type safety and data cleansing done for the data at rest
Typically more efficient (storage size and computationally) since the data is already parsed

Downsides of schema on write:

You have to plan ahead of time what your schema is before you store the data (i.e., you have to do ETL)
Typically you throw away the original data, which could be bad if you have a bug in your ingest process
It's harder to have different views of the same data

Benefits of schema on read:

Flexibility in defining how your data is interpreted at load time
- This gives you the ability to evolve your "schema" as time goes on
- This allows you to have different versions of your "schema"
- This allows the original source data format to change without having to consolidate to one data format
You get to keep your original data
You can load your data before you know what to do with it (so you don't drop it on the ground)
Gives you flexibility in being able to store unstructured, unclean, and/or unorganized data

Downsides of schema on read:

Generally it is less efficient because you have to reparse and reinterpret the data every time (this can be expensive with formats like XML)
The data is not self-documenting (i.e., you can't look at a schema to figure out what the data is)
More error prone and your analytics have to account for dirty data

200

answered Oct 23 '22 03:10

Donald Miner

Related questions
                            
                                How to choose between apache ranger and sentry
                            
                                how to use hadoop for a web application?
                            
                                Why does the Hadoop incompatible namespaceIDs issue happen?
                            
                                override log4j.properties in hadoop
                            
                                Hadoop: require root's password after enter "start-all.sh"
                            
                                Skipping the header while loading the text file using Piglatin
                            
                                copyFromLocal: `/user/hduser/gutenberg': No such file or directory
                            
                                HBase getting all timestamped values for a cell
                            
                                how to sort numerically in hadoop's shuffle/sort phase?
                            
                                Hadoop native libraries not found on OS/X
                            
                                Is there any Conditional IF like operator in Apache PIG?
                            
                                Python Connection to Hive
                            
                                How to read a .deflate file in hadoop
                            
                                Why we need Avro schema evolution
                            
                                Hive: Table creation with multi-files with multiple directories
                            
                                Hive throws: WstxParsingException: Illegal character entity: expansion character (code 0x8)
                            
                                NotSerializableException on anonymous class
                            
                                Why does "hadoop fs -mkdir" fail with Permission Denied?
                            
                                Sqoop Import --password-file function not working properly in sqoop 1.4.4
                            
                                Hadoop “Unable to load native-hadoop library for your platform” error on docker-spark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With