In our application, we collect data on automotive engine performance -- basically source data on engine performance based on the engine type, the vehicle running it and the engine design. Currently, the basis for new row inserts is an engine on-off period; we monitor performance variables based on a change in engine state from active to inactive and vice versa. The related <code>engineState</code> table looks like this: <pre class="prettyprint"><code>+---------+-----------+---------------+---------------------+---------------------+-----------------+ | vehicle | engine | engine_state | state_start_time | state_end_time | engine_variable | +---------+-----------+---------------+---------------------+---------------------+-----------------+ | 080025 | E01 | active | 2008-01-24 16:19:15 | 2008-01-24 16:24:45 | 720 | | 080028 | E02 | inactive | 2008-01-24 16:19:25 | 2008-01-24 16:22:17 | 304 | +---------+-----------+---------------+---------------------+---------------------+-----------------+ </code></pre> For a specific analysis, we would like to analyze table content based on a row granularity of minutes, rather than the current basis of active / inactive engine state. For this, we are thinking of creating a simple <code>productionMinute</code> table with a row for each minute in the period we are analyzing and joining the <code>productionMinute</code> and <code>engineEvent</code> tables on the date-time columns in each table. So if our period of analysis is from 2009-12-01 to 2010-02-28, we would create a new table with 129,600 rows, one for each minute of each day for that three-month period. The first few rows of the <code>productionMinute</code> table: <pre class="prettyprint"><code>+---------------------+ | production_minute | +---------------------+ | 2009-12-01 00:00 | | 2009-12-01 00:01 | | 2009-12-01 00:02 | | 2009-12-01 00:03 | +---------------------+ </code></pre> The join between the tables would be: <pre class="prettyprint"><code> FROM engineState AS es LEFT JOIN productionMinute AS pm ON pm.production_minute >= es.state_start_time AND pm.production_minute <= es.event_end_time </code></pre> This join, however, brings up multiple environmental issues: <ol> <li>The <code>engineState</code> table has 5 million rows and the <code>productionMinute</code> table has 130,000 rows</li> <li>When an <code>engineState</code> row spans more than one minute (i.e. the difference between <code>es.state_start_time</code> and <code>es.state_end_time</code> is greater than one minute), as is the case in the example above, there are multiple <code>productionMinute</code> table rows that join to a single <code>engineState</code> table row</li> <li>When there is more than one engine in operation during any given minute, also as per the example above, multiple <code>engineState</code> table rows join to a single <code>productionMinute</code> row</li> </ol> In testing our logic and using only a small table extract (one day rather than 3 months, for the <code>productionMinute</code> table) the query takes over an hour to generate. In researching this item in order to improve performance so that it would be feasible to query three months of data, our thoughts were to create a temporary table from the <code>engineEvent</code> one, eliminating any table data that is not critical for the analysis, and joining the temporary table to the <code>productionMinute</code> table. We are also planning on experimenting with different joins -- specifically an inner join -- to see if that would improve performance. What is the best query design for joining tables with the many:many relationship between the join predicates as outlined above? What is the best join type (left / right, inner)?

I agree with vy32. You need to do this query once and only once to get your data in a format suitable for analysis. You should use a proper ETL tool (or heck, just perl or something simple) to get the data out of the engineState table, calculate the production minute, then load it into another DB that properly modeled for analysis type queries. If you think your problem through you're just denormalizing your data and assigning minute numbers as surrogate keys. This is a relatively easy (and common) ETL problem which isn't performant in straight SQL but is simple with other languages and tools. Your production volume would be easily handled by a true ETL process.

In MySQL, what is the most effective query design for joining large tables with many to many relationships between the join predicates?

Tags:

performance

sql

join

mysql

In our application, we collect data on automotive engine performance -- basically source data on engine performance based on the engine type, the vehicle running it and the engine design. Currently, the basis for new row inserts is an engine on-off period; we monitor performance variables based on a change in engine state from active to inactive and vice versa. The related engineState table looks like this:

+---------+-----------+---------------+---------------------+---------------------+-----------------+
| vehicle | engine    | engine_state  | state_start_time    | state_end_time      | engine_variable |
+---------+-----------+---------------+---------------------+---------------------+-----------------+
| 080025  | E01       | active        | 2008-01-24 16:19:15 | 2008-01-24 16:24:45 |             720 | 
| 080028  | E02       | inactive      | 2008-01-24 16:19:25 | 2008-01-24 16:22:17 |             304 |
+---------+-----------+---------------+---------------------+---------------------+-----------------+

For a specific analysis, we would like to analyze table content based on a row granularity of minutes, rather than the current basis of active / inactive engine state. For this, we are thinking of creating a simple productionMinute table with a row for each minute in the period we are analyzing and joining the productionMinute and engineEvent tables on the date-time columns in each table. So if our period of analysis is from 2009-12-01 to 2010-02-28, we would create a new table with 129,600 rows, one for each minute of each day for that three-month period. The first few rows of the productionMinute table:

+---------------------+ 
| production_minute   |
+---------------------+
| 2009-12-01 00:00    |
| 2009-12-01 00:01    |
| 2009-12-01 00:02    |     
| 2009-12-01 00:03    |
+---------------------+

The join between the tables would be:

     FROM engineState AS es 
LEFT JOIN productionMinute AS pm ON pm.production_minute >= es.state_start_time 
                                AND pm.production_minute <= es.event_end_time

This join, however, brings up multiple environmental issues:

The engineState table has 5 million rows and the productionMinute table has 130,000 rows
When an engineState row spans more than one minute (i.e. the difference between es.state_start_time and es.state_end_time is greater than one minute), as is the case in the example above, there are multiple productionMinute table rows that join to a single engineState table row
When there is more than one engine in operation during any given minute, also as per the example above, multiple engineState table rows join to a single productionMinute row

In testing our logic and using only a small table extract (one day rather than 3 months, for the productionMinute table) the query takes over an hour to generate. In researching this item in order to improve performance so that it would be feasible to query three months of data, our thoughts were to create a temporary table from the engineEvent one, eliminating any table data that is not critical for the analysis, and joining the temporary table to the productionMinute table. We are also planning on experimenting with different joins -- specifically an inner join -- to see if that would improve performance.

What is the best query design for joining tables with the many:many relationship between the join predicates as outlined above? What is the best join type (left / right, inner)?

687

asked Mar 13 '10 19:03

lighthouse65

1 Answers

I agree with vy32. You need to do this query once and only once to get your data in a format suitable for analysis. You should use a proper ETL tool (or heck, just perl or something simple) to get the data out of the engineState table, calculate the production minute, then load it into another DB that properly modeled for analysis type queries.

If you think your problem through you're just denormalizing your data and assigning minute numbers as surrogate keys. This is a relatively easy (and common) ETL problem which isn't performant in straight SQL but is simple with other languages and tools.

Your production volume would be easily handled by a true ETL process.

131

answered Sep 17 '22 18:09

bot403

Related questions
                            
                                Why does this row only show up when running PHP on Windows and not on CentOS?
                            
                                Sql aggregate function to obtain a list
                            
                                Bug in NHibernate Aliasing
                            
                                How do I securely store passwords to a 3rd party webservice in my database?
                            
                                PDO execute($input_parameter) protects from sql injections as bindParam/bindValue?
                            
                                Yii addInCondition with floats: How to ? and why addInCondition('column', array(1.1, 1.3)) don't work?
                            
                                Join to an in-memory list efficiently
                            
                                Sql server 2012 fetch vs old row_number performance. What am I missing? Why is row_number 17x faster?
                            
                                How do I set a root password for a Cloud SQL instance in Google App Engine? ["Instance busy" error message]
                            
                                SQL: WITH clause with parameters?
                            
                                SQL Query Limit for DB2 AS/400 Version 4
                            
                                Postgresql 9.4 query gets progressively slower when joining TSTZRANGE with &&
                            
                                How to code a certain maths algorithm
                            
                                MS Access VBA Data Type Mismatch Error in SQL Query
                            
                                Convert OData to sql string
                            
                                Are there any existing, elegant, patterns for an optional TOP clause?
                            
                                MySQL json_search on numeric values
                            
                                DISTINCT and LAG window function
                            
                                Does MySQL or MariaDB have any kind of in-memory database?
                            
                                How to choose solutions for secure messaging front-end and HIPAA compliant database?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With