Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data Warehousing arbitrary fields

In our application, we support user-written plugins.

Those plugins generate data of various types (int, float, str, or datetime), and those data are labeled with bunches of meta-data (user, current directory, etc.) as well as three free-text fields (MetricName, Var1, Var2) .

Now we have several years of this data, and I'm trying to design a schema which allows very fast access to those metrics in an analytical fashion (charts and stuff). This is easy as long as there are only a few metrics we're interested in, but we have a large number of different metrics at different granularities, and we'd like to store user-added data to allow for later analysis (possibly after a schema change).

Example data: (please keep in mind this is very simplified)

=========================================================================================================
| BaseDir         | User    | TrialNo | Project | ... | MetricValue | MetricName | Var1 | Var2      |
=========================================================================================================
| /path/to/me     | me      | 0       | domino  | ... | 20          | Errors     | core  | dumb      |
| /path/to/me     | me      | 0       | domino  | ... | 98.6        | Tempuratur | body  |           |
| /some/other/pwd | oneguy  | 223     | farq    | ... | 443         | ManMonths  | waste | Mythical  |
| /some/other/pwd | oneguy  | 224     | farq    | ... | 0           | Albedo     | nose  | PolarBear |
| /path/to/me     | me      | 0       | domino  | ... | 70.2        | Tempuratur | room  |           |
| /path/to/me2    | me      | 2       | domino  | ... | 2020        | Errors     | misc  | filtered  |

Anyone can add a parser plugin to start measuring a AirSpeed metric, and we'd like our analisys tools to "just work" on that new metric.


Update:

Considering that many of the MetricName's are well-known beforehand, I can satisfy my requirements if I can enable analysis on those metrics, and simply store the other user-added metrics. We can accept the fact that new metrics won't be available for heavy-duty analysis without an edit to the schema.

What do you guys think of this solution?

I've divided our metrics into three fact tables, one for facts that don't need a MetricTopic, one for ones that do, and one for all the other metrics, including unexpected ones.

Metrics Schema #3


For the bounty:

I'll accept any critique which shows how to make this system more functional, or brings it into closer alignment with industry best-practices. References to literature gives added weight.

like image 802
bukzor Avatar asked Sep 15 '10 23:09

bukzor


People also ask

What are the different types of data warehousing?

The three main types of data warehouses are enterprise data warehouse (EDW), operational data store (ODS), and data mart.

What are the basic elements of data warehousing?

A typical data warehouse has four main components: a central database, ETL (extract, transform, load) tools, metadata, and access tools.

What are the three data warehouse models?

4.1. 5 Data Warehouse Models: Enterprise Warehouse, Data Mart, and Virtual Warehouse. From the architecture point of view, there are three data warehouse models: the enterprise warehouse, the data mart, and the virtual warehouse.

What are data warehousing concepts?

A data warehouse is a central repository of information that can be analyzed to make more informed decisions. Data flows into a data warehouse from transactional systems, relational databases, and other sources, typically on a regular cadence.


1 Answers

If I understand correctly, you are looking for a schema to support on-fly creation of measures in a DW. In a classical data warehouse each measure is a column, so in a Kimball star you would need to add a column for each new measure -- change the schema.

What you have is an EAV model, and analytics on EAV is not easy and not fast -- take a look at this discussion.

I would suggest you look at tools like splunk, which is suited for theis type of problems.

like image 172
Damir Sudarevic Avatar answered Sep 21 '22 23:09

Damir Sudarevic