In our application, we support user-written plugins.
Those plugins generate data of various types (int, float, str, or datetime), and those data are labeled with bunches of meta-data (user, current directory, etc.) as well as three free-text fields (MetricName, Var1, Var2) .
Now we have several years of this data, and I'm trying to design a schema which allows very fast access to those metrics in an analytical fashion (charts and stuff). This is easy as long as there are only a few metrics we're interested in, but we have a large number of different metrics at different granularities, and we'd like to store user-added data to allow for later analysis (possibly after a schema change).
Example data: (please keep in mind this is very simplified)
=========================================================================================================
| BaseDir | User | TrialNo | Project | ... | MetricValue | MetricName | Var1 | Var2 |
=========================================================================================================
| /path/to/me | me | 0 | domino | ... | 20 | Errors | core | dumb |
| /path/to/me | me | 0 | domino | ... | 98.6 | Tempuratur | body | |
| /some/other/pwd | oneguy | 223 | farq | ... | 443 | ManMonths | waste | Mythical |
| /some/other/pwd | oneguy | 224 | farq | ... | 0 | Albedo | nose | PolarBear |
| /path/to/me | me | 0 | domino | ... | 70.2 | Tempuratur | room | |
| /path/to/me2 | me | 2 | domino | ... | 2020 | Errors | misc | filtered |
Anyone can add a parser plugin to start measuring a AirSpeed metric, and we'd like our analisys tools to "just work" on that new metric.
Update:
Considering that many of the MetricName's are well-known beforehand, I can satisfy my requirements if I can enable analysis on those metrics, and simply store the other user-added metrics. We can accept the fact that new metrics won't be available for heavy-duty analysis without an edit to the schema.
What do you guys think of this solution?
I've divided our metrics into three fact tables, one for facts that don't need a MetricTopic, one for ones that do, and one for all the other metrics, including unexpected ones.
For the bounty:
I'll accept any critique which shows how to make this system more functional, or brings it into closer alignment with industry best-practices. References to literature gives added weight.
The three main types of data warehouses are enterprise data warehouse (EDW), operational data store (ODS), and data mart.
A typical data warehouse has four main components: a central database, ETL (extract, transform, load) tools, metadata, and access tools.
4.1. 5 Data Warehouse Models: Enterprise Warehouse, Data Mart, and Virtual Warehouse. From the architecture point of view, there are three data warehouse models: the enterprise warehouse, the data mart, and the virtual warehouse.
A data warehouse is a central repository of information that can be analyzed to make more informed decisions. Data flows into a data warehouse from transactional systems, relational databases, and other sources, typically on a regular cadence.
If I understand correctly, you are looking for a schema to support on-fly creation of measures in a DW. In a classical data warehouse each measure is a column, so in a Kimball star you would need to add a column for each new measure -- change the schema.
What you have is an EAV model, and analytics on EAV is not easy and not fast -- take a look at this discussion.
I would suggest you look at tools like splunk, which is suited for theis type of problems.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With