I am designing a new hadoop-based data warehouse using hive and I was wondering whether the classic star/snowflake schemas were still a "standard" in this context. Big Data systems embrace redundancy so that fully normalized schemas have usually poor performance (for example, in NoSQL databases like HBase or Cassandra). Is still a best practice making star-schema data warehouses with hive? Is it better designing row-wide (reduntant) tables, by exploiting new columnar file formats?

Joins are evil. In particular on Hadoop where we can't guarantee data co-locality especially in case we need to join two large tables. This is one of the differences between Hadoop and a traditional MPP such as Teradata, Greenplum etc. In an MPP I evenly distribute my data based on a hashed key across all nodes in my cluster. The relevant rows for order and order_item table would end up on the same nodes in my cluster, which would at least eliminate data transfer across the network. In Hadoop you would nest the order_item data inside the order table, which would eliminate the need for joins. If on the other hand you have a small lookup/dimension table and a large fact table you can broadcast the small table across all nodes in your cluster thereby eliminating the need for network transfer. In summary, star schemas are still relevant but mostly from a logical modelling point of view. Physically you may be better off denormalizing even further to create one big columnar compressed and nested fact table. I have written up a full blog post that discusses the purpose and usefulness of dimensional models on Hadoop and Big Data technologies

When designing for NoSQL databases you tend to optimize for a specific query by preprocessing parts of the query and thus store a denormalized copy of the data (albeit denormalized in a query-specific way). The star schema, on the other hand, is an all-purpose denormalization that's usually appropriate. When you're planning on using hive, you're really not using it for the optimization but for the general-purposefullness (?) of SQL and as such, I'd imagine the star schema is still appropriate. For a NoSQL db with a non-SQL interface, however, I'd suggest you use a more query-specific design.

Is star schema still necessary for a big-data-warehouse?

2 Answers

Joins are evil. In particular on Hadoop where we can't guarantee data co-locality especially in case we need to join two large tables. This is one of the differences between Hadoop and a traditional MPP such as Teradata, Greenplum etc. In an MPP I evenly distribute my data based on a hashed key across all nodes in my cluster. The relevant rows for order and order_item table would end up on the same nodes in my cluster, which would at least eliminate data transfer across the network. In Hadoop you would nest the order_item data inside the order table, which would eliminate the need for joins.

If on the other hand you have a small lookup/dimension table and a large fact table you can broadcast the small table across all nodes in your cluster thereby eliminating the need for network transfer.

In summary, star schemas are still relevant but mostly from a logical modelling point of view. Physically you may be better off denormalizing even further to create one big columnar compressed and nested fact table.

I have written up a full blog post that discusses the purpose and usefulness of dimensional models on Hadoop and Big Data technologies

129

answered Nov 10 '22 02:11

Uli Bethke

When designing for NoSQL databases you tend to optimize for a specific query by preprocessing parts of the query and thus store a denormalized copy of the data (albeit denormalized in a query-specific way).

The star schema, on the other hand, is an all-purpose denormalization that's usually appropriate.

When you're planning on using hive, you're really not using it for the optimization but for the general-purposefullness (?) of SQL and as such, I'd imagine the star schema is still appropriate. For a NoSQL db with a non-SQL interface, however, I'd suggest you use a more query-specific design.

answered Nov 10 '22 01:11

Chris Gerken

Related questions
                            
                                Column to comma separated value in Hive
                            
                                Hive doesn't support in, exists. How do I write the following query?
                            
                                how to include external jar file using PIG
                            
                                What is the computational complexity of the MapReduce overhead
                            
                                Incremental MapReduce implementations (other than CouchDB, preferably)
                            
                                Generating star schema in hive
                            
                                To make a distance matrix or to repeatedly calculate distance
                            
                                HBase regionserver is aborted and can never be brought up after that
                            
                                UnsatisfiedLinkError (NativeIO$Windows.access0) when submitting mapreduce job to hadoop 2.2 from windows to ubuntu
                            
                                Sorting JavaPairRDD first by value and then by key
                            
                                Spark nodes keep printing GC (Allocation Failure) and no tasks run
                            
                                How to shade a transitive dependency in Gradle?
                            
                                hadoop use cases in real world [closed]
                            
                                best possible implementation of the travelling salesman / vehicle routing use case
                            
                                MapReduce Linear Programming
                            
                                What is the best way to run Map/Reduce stuff on data from Mongo?
                            
                                Using Apache Spark as a backend for web application [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is star schema still necessary for a big-data-warehouse?

Tags:

hadoop

hive

data-warehouse

Nicola Ferraro

People also ask

2 Answers

Uli Bethke

Chris Gerken

Recent Activity

Donate For Us