We are working on a datawarehouse for a bank and have pretty much followed the standard Kimball model of staging tables, a star schema and an ETL to pull the data through the process.
Kimball talks about using the staging area for import, cleaning, processing and everything until you are ready to put the data into the star schema. In practice this typically means uploading data from the sources into a set of tables with little or no modification, followed by taking data optionally through intermediate tables until it is ready to go into the star schema. That's a lot of work for a single entity, no single responsibility here.
Previous systems I have worked on have made a distinction between the different sets of tables, to the extent of having:
You can stick these in separate schemas and then apply differing policies for archive/backup/security etc. One of the other guys has worked on a warehouse where there is a StagingInput and a StagingOutput, similar story. The team as a whole has a lot of experience, both datawarehouse and otherwise.
However, despite all this, looking through Kimball and the web there seems to be absolutely nothing in writing about giving any kind of structure to the staging database. One would be forgiven for believing that Mr Kimball would have us all work with staging as this big deep dark unstructured pool of data.
Whilst of course it is pretty obvious how to go about it if we want to add some more structure to the staging area, it seems very odd that there seems to be nothing written about it.
So, what is everyone else out there doing? Is staging just this big unstructured mess or do folk have some interesting designs on it?
Map the data from its original form into a data model that is suitable for manipulation at the staging area. Validate and clean the data. Apply any transformations to the data that are required before the data sets are loaded into the repository. Map the data from its staging area model to its loading model.
A typical data warehouse has four main components: a central database, ETL (extract, transform, load) tools, metadata, and access tools. All of these components are engineered for speed so that you can get results quickly and analyze data on the fly.
A staging layer provides a closed off area for the loading and processing of source data, and is used as a workspace for subsequently applying transformations including complex calculations, data cleansing, and change data capture, before the data is loaded to the data warehouse for analysis.
I have experienced the same problem. We have a large HR DataWarehouse and I am pulling data from systems all over the enterprise. I've got a nice collection of Fact and Dimension tables, but the staging area is a mess. I don't know of any standards for design of this. I would follow the same path you are on and come up with a standard set of names to keep things in order. Your suggestion is pretty good for the naming. I'd keep working with that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With