What is the difference and meaning of these two statements that I encountered during a lecture here:
1. Traditional databases enforce schema during load time.
and
2. Hive enforces schema during read time.
With HIVE, we have Schema on read, which means the data is not verified before loading but rather when a query is issued, here we have very fast initial load as the data is not read.
Hadoop Distributed File System is the classical example of the schema on read system.
Hive supports Schema on read, which means data is checked with the schema when any query is issued on it. This is similar to the HDFS Write operation, where data is written distributedly on HDFS because we cannot check huge amount of data.
There is not default schema in Hive, in order to query data in hive you have to first create a table explaining the content of your data (by using create external table ... location ). So you basically have to tell hive the "scheme" before querying the data.
You touch on one of the reasons why Hadoop and other NoSQL strategies have been so successful, so I'm not sure if you were expecting to get a dissertation or not, but here it is! The extra flexibility and agility in data analysis has probably contributed to the explosion of "data science", just because it makes large-scale data analysis easier in general.
A traditional relational database stores the data with schema in mind. It knows that the second column is an integer, it knows that it has 40 columns, etc. Therefore, you need to specify your schema ahead of time and have it well planned out. This is "schema on write" -- that is, the schema is applied when the data is being written to the data store.
Hive (in some cases), Hadoop, and many other NoSQL systems in general are about "schema on read" -- the schema is applied as the data is being read off of the data store. Consider the following line of raw text:
A:B:C~E:F~G:H~~I::J~K~L
There are a couple ways to interpret this. ~
could be the delimiter or maybe :
could be the delimiter. Who knows? With schema on read, it doesn't matter. You decide what the schema is when you analyze the data, not when you write the data. This example is a bit ridiculous in that you probably won't ever encounter this case, but it gets the point across hopefully.
With schema on read, you just load your data into the data store and think about how to parse and interpret later. At the core of this explanation, schema on read means write your data first, figure out what it is later. Schema on write means figure out what your data is first, then write it after.
There is a tradeoff here. Some of these are subjective and my own opinion.
Benefits of schema on write:
Downsides of schema on write:
Benefits of schema on read:
Downsides of schema on read:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With