Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In MySQL, what is the most effective query design for joining large tables with many to many relationships between the join predicates?

In our application, we collect data on automotive engine performance -- basically source data on engine performance based on the engine type, the vehicle running it and the engine design. Currently, the basis for new row inserts is an engine on-off period; we monitor performance variables based on a change in engine state from active to inactive and vice versa. The related engineState table looks like this:

+---------+-----------+---------------+---------------------+---------------------+-----------------+
| vehicle | engine    | engine_state  | state_start_time    | state_end_time      | engine_variable |
+---------+-----------+---------------+---------------------+---------------------+-----------------+
| 080025  | E01       | active        | 2008-01-24 16:19:15 | 2008-01-24 16:24:45 |             720 | 
| 080028  | E02       | inactive      | 2008-01-24 16:19:25 | 2008-01-24 16:22:17 |             304 |
+---------+-----------+---------------+---------------------+---------------------+-----------------+ 

For a specific analysis, we would like to analyze table content based on a row granularity of minutes, rather than the current basis of active / inactive engine state. For this, we are thinking of creating a simple productionMinute table with a row for each minute in the period we are analyzing and joining the productionMinute and engineEvent tables on the date-time columns in each table. So if our period of analysis is from 2009-12-01 to 2010-02-28, we would create a new table with 129,600 rows, one for each minute of each day for that three-month period. The first few rows of the productionMinute table:

+---------------------+ 
| production_minute   |
+---------------------+
| 2009-12-01 00:00    |
| 2009-12-01 00:01    |
| 2009-12-01 00:02    |     
| 2009-12-01 00:03    |
+---------------------+

The join between the tables would be:

     FROM engineState AS es 
LEFT JOIN productionMinute AS pm ON pm.production_minute >= es.state_start_time 
                                AND pm.production_minute <= es.event_end_time 

This join, however, brings up multiple environmental issues:

  1. The engineState table has 5 million rows and the productionMinute table has 130,000 rows
  2. When an engineState row spans more than one minute (i.e. the difference between es.state_start_time and es.state_end_time is greater than one minute), as is the case in the example above, there are multiple productionMinute table rows that join to a single engineState table row
  3. When there is more than one engine in operation during any given minute, also as per the example above, multiple engineState table rows join to a single productionMinute row

In testing our logic and using only a small table extract (one day rather than 3 months, for the productionMinute table) the query takes over an hour to generate. In researching this item in order to improve performance so that it would be feasible to query three months of data, our thoughts were to create a temporary table from the engineEvent one, eliminating any table data that is not critical for the analysis, and joining the temporary table to the productionMinute table. We are also planning on experimenting with different joins -- specifically an inner join -- to see if that would improve performance.

What is the best query design for joining tables with the many:many relationship between the join predicates as outlined above? What is the best join type (left / right, inner)?

like image 687
lighthouse65 Avatar asked Mar 13 '10 19:03

lighthouse65


People also ask

Which join is most efficient in SQL?

Relational algebra is the most common way of writing a query and also the most natural way to do so. The code is clean, easy to troubleshoot, and unsurprisingly, it is also the most efficient way to join two tables.

What is the most important aspect of a good MySQL database design for speed and performance?

The two most important decisions you make with any database are designing how relationships between application entities are mapped to tables (the database schema) and designing how applications get the data they need in the format they need it (queries).


1 Answers

I agree with vy32. You need to do this query once and only once to get your data in a format suitable for analysis. You should use a proper ETL tool (or heck, just perl or something simple) to get the data out of the engineState table, calculate the production minute, then load it into another DB that properly modeled for analysis type queries.

If you think your problem through you're just denormalizing your data and assigning minute numbers as surrogate keys. This is a relatively easy (and common) ETL problem which isn't performant in straight SQL but is simple with other languages and tools.

Your production volume would be easily handled by a true ETL process.

like image 131
bot403 Avatar answered Sep 17 '22 18:09

bot403