Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

EDW Kimball vs Inmon

I've been tasked with coming up with a recommendation of how to proceed with a EDW and am looking for clarification on what I'm seeing. Everything that I am learning about states that Kimball's approach will bring value quicker to business vs Inmon's. I get that Kimball's approach is a dimensional model from the getgo and different data marts (star schema) are integrated through conformed dimensions... thus the theory is I can simply come up with my immediate DM to solve business need and go on from there.

What I'm learning states that Inmon's model suggests that I have a EDW designed in 3NF. The EDW is not defined by source system but instead the structure of the business, Corporate Factory (Orders, HR, etc.). So data from disparate systems map into this structure. Once the data is in this form, ETLs are then created to produce DMs.

Personally I feel Inmon's approach is a better way. I believe this way is going to ensure that data is going to be consistent and it feels like you can do more with this data. What holds me back with this approach though is everything I'm reading says it's going to take much more time to deliver something but I'm not seeing how that is true. From my narrow view, it feels like no matter what the end result is we need a DM. Regardless of using Kimball's or Inmon's approach the end result is the same.

So then the question becomes how do we get there? In Kimballs approach we will create ETLs to some staging location and generally from there create a DM. In Inmon's approach I feel we just add in another layer... that is from the staging area we load this data into another database in 3NF organized by function. What I'm missing is how this step adds so much time.

I feel I can look at the end DM that needs to be made. Map those back to a DW in 3NF and then as more DMs are requested keep building up the DW in 3NF with more and more data. However if I create a DM in Kimballs model that DM is going to be built around the level of grain decided for that DM and what if the next DM requested wants reporting at even a deeper grain (to me it feels like Kimballs methodology would take more work) and with Inmon's it doesn't matter. I have everything at the transnational level so DMs of varying grains are requested, well I have the data, just ETL it to a DM and all DMs will report the same since they are sourced from the same data.

I dunno... just looking for others views. Everything I read says Kimball's is quicker... I say sure maybe a little bit but there is certainly a cost attributed by going to quicker route. And for sake of argument... let's say it takes a week to get a DM up and running through Kimballs methodology... to me it feels like it should only take 10% maybe 20% longer utilizing Inmon's.

If anyone has any real world experience with the different models and if one really takes so much longer then the other... please share. Or if I have this so backwards tell me that too!

like image 208
user3776554 Avatar asked Dec 12 '16 22:12

user3776554


People also ask

Which is better Inmon or Kimball?

Frequent Changes: If your reporting needs are likely to change more quickly and you are dealing with volatile source systems, then opt for the Inmon method as it offers more flexibility. However, if reporting needs and source systems are comparatively stable, it's better to use the Kimball method.

What is the difference b'n Kimball and Inmon approach in data warehouse designing?

The Kimball approach utilizes dimensional models such as star and snowflake schema to organize the data into various business classified data to enable business processes quickly. Inmon, on the other hand, considers the overall corporate data requirement, and as such, it utilizes the ER modelling technique.

What is Inmon approach in data warehouse?

Inmon's top-down approach Inmon defines a data warehouse as a centralised repository for the entire enterprise. A data warehouse stores the “atomic” data at the lowest level of detail. Dimensional data marts are created only after the complete data warehouse has been created.

Is Kimball data warehouse still relevant?

So, is Kimball still relevant in a modern DW architecture? It depends, but for most data warehouse the answer is… yes, but the reason it is not performance anymore. Despite a wide denormalised table has improved performance; it can be difficult to maintain.


2 Answers

For context; I look after a 3 billion record data warehouse, for a large multi-national. Our data makes its way from the various source systems through staging and into a 3NF db. From here our ELT processes move the data into a dimensionally modelled, star schema db.

If I could start again I would definitely drop the 3NF step. When I first built that layer I thought it would add real value. I felt sure that normalisation would protect the integrity of my data. I was equally confident the 3NF db would be the best place to run large/complex queries.

But in practice, it has slowed our development. Most changes require an update to the stage, 3NF and star schema db.

The extra layer also increases the amount of time it takes to publish our data. The additional transformations, checks and reconciliations all add up.

The promised improvement in integrity never materialised. I realise now that because I control the ETL, and the validation processes within, I can ensure my data is both denormalised and accurate. In reporting data we control every cell in every table. The more I think about that, the more I see it as a real opportunity.

Large and complex queries was another myth that has been busted by experience. I now see the need to write complex reporting queries as a failing of my star db. When this occurs I always ask myself: why isn't this question easy to answer? The answer is most often bad table design. The heavy lifting is best carried out when transforming the data.

Running a 3NF and star also creates an opportunity for the two systems to disagree. When this happens it is often a very subtle difference. Neither is wrong, per se. Instead, it is possible the 3NF and star query are asking slightly different questions, and therefore returning different results. Although technically correct, this can be hard to explain. Even minor and explainable differences can erode confidence, over time.

In defence of our 3NF db, it does make loading into the star easier. But I would happily trade more complex SSIS packages for one less layer.

Having said all of this; it is very hard to recommend an approach to anyone without a deep understanding of their systems, requirements, culture, skills, etc. Having read your question I am sure you have wrestled with all these issues, and many more no doubt! In the end, only you can decide what the best approach for your situation is. Once you've made your mind up, stick with it. Consistency, clarity and a well-defined methodology are more important that anything else.

like image 140
David Rushton Avatar answered Oct 15 '22 12:10

David Rushton


Dimensions and measures are a well proven method for presenting and simplifying data to end users.

If you present a schema based on the source system (3nf) to an end user, vs a dimensionally modelled star schema (Kimball) to an end user, they will be able to make much more sense of the dimensionally modelled one

I've never really looked into an Inmon decision support system but to me it seems to be just the ODS portion of a full datawarehouse.

You are right in saying "The EDW is not defined by source system but instead the structure of the business". A star schema reflects this but an ODS (a copy of the source system) doesn't

A star schema takes longer to build than just an ODS but gives many benefits including

  • Slowly changing dimensions can track changes over time
  • Denormalisation simplifies joins and improves performance
  • Surrogate keys allow you to disconnect from source systems
  • Conformed dimensions let you report across business units (i.e. Profit per headcount)

If your Inmon 3NF database is not just an ODS (replica of source systems), but some kind of actual business model then you have two layers to model: the 3NF layer and the star schema layer.

It's difficult nowadays to sell the benefit of even one layer of data modelling when everyone thinks they can just do it all in a 'self service' tool! (which I believe is a fallacy). Your system should be no more complicated than it needs to be because all that complexity adds up to maintenance and that's the real issue - introducing changes 12 months into the build when you have to change many layers

To paraphrase @destination-data: your source system to star schema transformation (and seperation) is already achieved through ETL so the 3nf seems redundant to me. You design your star schema to be independent from source systems by correctly implementing surrogate keys and business keys, and modelling it on the business, not on the source system

like image 39
Nick.McDermaid Avatar answered Oct 15 '22 12:10

Nick.McDermaid