In our application, we collect data on automotive engine performance -- basically source data on engine performance based on the engine type, the vehicle running it and the engine design. Currently, the basis for new row inserts is an engine on-off period; we monitor performance variables based on a change in engine state from active to inactive and vice versa. The related engineState
table looks like this:
+---------+-----------+---------------+---------------------+---------------------+-----------------+
| vehicle | engine | engine_state | state_start_time | state_end_time | engine_variable |
+---------+-----------+---------------+---------------------+---------------------+-----------------+
| 080025 | E01 | active | 2008-01-24 16:19:15 | 2008-01-24 16:24:45 | 720 |
| 080028 | E02 | inactive | 2008-01-24 16:19:25 | 2008-01-24 16:22:17 | 304 |
+---------+-----------+---------------+---------------------+---------------------+-----------------+
For a specific analysis, we would like to analyze table content based on a row granularity of minutes, rather than the current basis of active / inactive engine state. For this, we are thinking of creating a simple productionMinute
table with a row for each minute in the period we are analyzing and joining the productionMinute
and engineEvent
tables on the date-time columns in each table. So if our period of analysis is from 2009-12-01 to 2010-02-28, we would create a new table with 129,600 rows, one for each minute of each day for that three-month period. The first few rows of the productionMinute
table:
+---------------------+
| production_minute |
+---------------------+
| 2009-12-01 00:00 |
| 2009-12-01 00:01 |
| 2009-12-01 00:02 |
| 2009-12-01 00:03 |
+---------------------+
The join between the tables would be:
FROM engineState AS es
LEFT JOIN productionMinute AS pm ON pm.production_minute >= es.state_start_time
AND pm.production_minute <= es.event_end_time
This join, however, brings up multiple environmental issues:
engineState
table has 5 million rows and the productionMinute
table has 130,000 rowsengineState
row spans more than one minute (i.e. the difference between es.state_start_time
and es.state_end_time
is greater than one minute), as is the case in the example above, there are multiple productionMinute
table rows that join to a single engineState
table rowengineState
table rows join to a single productionMinute
rowIn testing our logic and using only a small table extract (one day rather than 3 months, for the productionMinute
table) the query takes over an hour to generate. In researching this item in order to improve performance so that it would be feasible to query three months of data, our thoughts were to create a temporary table from the engineEvent
one, eliminating any table data that is not critical for the analysis, and joining the temporary table to the productionMinute
table. We are also planning on experimenting with different joins -- specifically an inner join -- to see if that would improve performance.
What is the best query design for joining tables with the many:many relationship between the join predicates as outlined above? What is the best join type (left / right, inner)?
Relational algebra is the most common way of writing a query and also the most natural way to do so. The code is clean, easy to troubleshoot, and unsurprisingly, it is also the most efficient way to join two tables.
The two most important decisions you make with any database are designing how relationships between application entities are mapped to tables (the database schema) and designing how applications get the data they need in the format they need it (queries).
I agree with vy32. You need to do this query once and only once to get your data in a format suitable for analysis. You should use a proper ETL tool (or heck, just perl or something simple) to get the data out of the engineState table, calculate the production minute, then load it into another DB that properly modeled for analysis type queries.
If you think your problem through you're just denormalizing your data and assigning minute numbers as surrogate keys. This is a relatively easy (and common) ETL problem which isn't performant in straight SQL but is simple with other languages and tools.
Your production volume would be easily handled by a true ETL process.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With