Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is star schema still necessary for a big-data-warehouse?

I am designing a new hadoop-based data warehouse using hive and I was wondering whether the classic star/snowflake schemas were still a "standard" in this context.

Big Data systems embrace redundancy so that fully normalized schemas have usually poor performance (for example, in NoSQL databases like HBase or Cassandra).

Is still a best practice making star-schema data warehouses with hive?

Is it better designing row-wide (reduntant) tables, by exploiting new columnar file formats?

like image 272
Nicola Ferraro Avatar asked Jun 13 '15 22:06

Nicola Ferraro


People also ask

Which schema is best for data warehouse?

Star schema is the type of multidimensional model which is used for data warehouse.

What is a good alternative to the star schema?

However, alternatives to the star schema, such as snowflake schemas and galaxy schemas, exist for users who will get more benefits from modeling their data warehouse in a different way .

What is a star schema Why is it good for data warehousing?

A star schema is a database organizational structure optimized for use in a data warehouse or business intelligence that uses a single large fact table to store transactional or measured data, and one or more smaller dimensional tables that store attributes about the data.

How is star schema better than the regular relational database model in data warehousing?

A star schema model is designed with the following in mind: Redundant data storage for performance: Data is stored in significantly fewer tables than a typical transactional database, which are NOT in 3NF which means columns in a table contains data which is repeated throughout the table.


2 Answers

Joins are evil. In particular on Hadoop where we can't guarantee data co-locality especially in case we need to join two large tables. This is one of the differences between Hadoop and a traditional MPP such as Teradata, Greenplum etc. In an MPP I evenly distribute my data based on a hashed key across all nodes in my cluster. The relevant rows for order and order_item table would end up on the same nodes in my cluster, which would at least eliminate data transfer across the network. In Hadoop you would nest the order_item data inside the order table, which would eliminate the need for joins.

If on the other hand you have a small lookup/dimension table and a large fact table you can broadcast the small table across all nodes in your cluster thereby eliminating the need for network transfer.

In summary, star schemas are still relevant but mostly from a logical modelling point of view. Physically you may be better off denormalizing even further to create one big columnar compressed and nested fact table.

I have written up a full blog post that discusses the purpose and usefulness of dimensional models on Hadoop and Big Data technologies

like image 129
Uli Bethke Avatar answered Nov 10 '22 02:11

Uli Bethke


When designing for NoSQL databases you tend to optimize for a specific query by preprocessing parts of the query and thus store a denormalized copy of the data (albeit denormalized in a query-specific way).

The star schema, on the other hand, is an all-purpose denormalization that's usually appropriate.

When you're planning on using hive, you're really not using it for the optimization but for the general-purposefullness (?) of SQL and as such, I'd imagine the star schema is still appropriate. For a NoSQL db with a non-SQL interface, however, I'd suggest you use a more query-specific design.

like image 26
Chris Gerken Avatar answered Nov 10 '22 01:11

Chris Gerken