Generating star schema in hive

Tags:

I am from SQL Datawarehouse world where from a flat feed I generate dimension and fact tables. In general data warehouse projects we divide feed into fact and dimension. Ex:

enter image description here

I am completely new to Hadoop and I came to know that I can build data warehouse in hive. Now, I am familiar with using guid which I think is applicable as a primary key in hive. So, the below strategy is the right way to load fact and dimension in hive?

Load source data into a hive table; let say Sales_Data_Warehouse
Generate Dimension from sales_data_warehouse; ex:

SELECT New_Guid(), Customer_Name, Customer_Address From Sales_Data_Warehouse
When all dimensions are done then load the fact table like

SELECT New_Guid() AS 'Fact_Key', Customer.Customer_Key, Store.Store_Key... FROM Sales_Data_Warehouse AS 'source' JOIN Customer_Dimension Customer on source.Customer_Name = Customer.Customer_Name AND source.Customer_Address = Customer.Customer_Address JOIN Store_Dimension AS 'Store' ON Store.Store_Name = Source.Store_Name JOIN Product_Dimension AS 'Product' ON .....

Is this the way I should load my fact and dimension table in hive?

Also, in general warehouse projects we need to update dimensions attributes (ex: Customer_Address is changed to something else) or have to update fact table foreign key (rarely, but it does happen). So, how can I have a INSERT-UPDATE load in hive. (Like we do Lookup in SSIS or MERGE Statement in TSQL)?

629

asked Mar 28 '17 12:03

Zerotoinfinity

1 Answers

We still get the benefits of dimensional models on Hadoop and Hive. However, some features of Hadoop require us to slightly adopt the standard approach to dimensional modelling.

The Hadoop File System is immutable. We can only add but not update data. As a result we can only append records to dimension tables (While Hive has added an update feature and transactions this seems to be rather buggy). Slowly Changing Dimensions on Hadoop become the default behaviour. In order to get the latest and most up to date record in a dimension table we have three options. First, we can create a View that retrieves the latest record using windowing functions. Second, we can have a compaction service running in the background that recreates the latest state. Third, we can store our dimension tables in mutable storage, e.g. HBase and federate queries across the two types of storage.

The way how data is distributed across HDFS makes it expensive to join data. In a distributed relational database (MPP) we can co-locate records with the same primary and foreign keys on the same node in a cluster. This makes it relatively cheap to join very large tables. No data needs to travel across the network to perform the join. This is very different on Hadoop and HDFS. On HDFS tables are split into big chunks and distributed across the nodes on our cluster. We don’t have any control on how individual records and their keys are spread across the cluster. As a result joins on Hadoop for two very large tables are quite expensive as data has to travel across the network. We should avoid joins where possible. For a large fact and dimension table we can de-normalise the dimension table directly into the fact table. For two very large transaction tables we can nest the records of the child table inside the parent table and flatten out the data at run time. We can use SQL extensions such as array_agg in BigQuery/Postgres etc. to handle multiple grains in a fact table

I would also question the usefulness of surrogate keys. Why not use the natural key? Maybe performance for complex compound keys may be an issue but otherwise surrogate keys are not really useful and I never use them.

116

answered Oct 12 '22 05:10

Uli Bethke

Related questions
                            
                                Get hadoop configuration in Java util
                            
                                subprocess.check_output() module object has out attribute 'check_output'
                            
                                AttributeError: 'SparkContext' object has no attribute 'createDataFrame' using Spark 1.6
                            
                                Can Hive recursively descend into subdirectories without partitions or editing hive-site.xml?
                            
                                How do I read Snappy compressed files on HDFS without using Hadoop?
                            
                                write text from comand line into Hadoop
                            
                                Ext JS library not installed correctly in Oozie
                            
                                Error in Hive Query while joining tables
                            
                                Apache ZooKeeper WEB-UI
                            
                                Incorrect configuration: namenode address dfs.namenode.rpc-address is not configured
                            
                                java.lang.OutOfMemoryError: Java heap space with hive
                            
                                Hadoop Datanode, namenode, secondary-namenode, job-tracker and task-tracker
                            
                                Column to comma separated value in Hive
                            
                                Hive doesn't support in, exists. How do I write the following query?
                            
                                how to include external jar file using PIG
                            
                                What is the computational complexity of the MapReduce overhead
                            
                                Incremental MapReduce implementations (other than CouchDB, preferably)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Generating star schema in hive

Tags:

hadoop

hive

data-warehouse

dimensional-modeling

fact

Zerotoinfinity

People also ask

1 Answers

Uli Bethke

Recent Activity

Donate For Us