Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MySQL query with dependent subquery takes too long

I need an SQL guru to help me speed up my query.

I have 2 tables, quantities and prices. quantities records a quantity value between 2 timestamps, 15 minutes apart. prices records a price for a given timestamp, for a given price type and there is a price 5 record for every 5 minutes.

I need 2 work out the total price for each period, e.g. hour or day, between two timestamps. This is calculated by the sum of the (quantity multiplied by the average of the 3 prices in the 15 minute quantity window) in each period.

For example, let's say I want to see the total price each hour for 1 day. The total price value in each row in the result set is the sum of the total prices for each of the four 15 minute periods in that hour. And the total price for each 15 minute period is calculated by multiplying the quantity value in that period by the average of the 3 prices (one for each 5 minutes) in that quantity's period.

Here's the query I'm using, and the results:

SELECT
MIN( `quantities`.`start_timestamp` ) AS `start`,
MAX( `quantities`.`end_timestamp` ) AS `end`,
SUM( `quantities`.`quantity` * (
  SELECT AVG( `prices`.`price` )
  FROM `prices`
  WHERE `prices`.`timestamp` >= `quantities`.`start_timestamp`
  AND `prices`.`timestamp` < `quantities`.`end_timestamp`
  AND `prices`.`type_id` = 1
) ) AS total
FROM `quantities`
WHERE `quantities`.`start_timestamp` >= '2010-07-01 00:00:00'
AND `quantities`.`start_timestamp` < '2010-07-02 00:00:00'
GROUP BY HOUR(  `quantities`.`start_timestamp` );

+---------------------+---------------------+----------+
| start               | end                 | total    |
+---------------------+---------------------+----------+
| 2010-07-01 00:00:00 | 2010-07-01 01:00:00 | 0.677733 |
| 2010-07-01 01:00:00 | 2010-07-01 02:00:00 | 0.749133 |
| 2010-07-01 02:00:00 | 2010-07-01 03:00:00 | 0.835467 |
| 2010-07-01 03:00:00 | 2010-07-01 04:00:00 | 0.692233 |
| 2010-07-01 04:00:00 | 2010-07-01 05:00:00 | 0.389533 |
| 2010-07-01 05:00:00 | 2010-07-01 06:00:00 | 0.335300 |
| 2010-07-01 06:00:00 | 2010-07-01 07:00:00 | 1.231467 |
| 2010-07-01 07:00:00 | 2010-07-01 08:00:00 | 0.352800 |
| 2010-07-01 08:00:00 | 2010-07-01 09:00:00 | 1.447200 |
| 2010-07-01 09:00:00 | 2010-07-01 10:00:00 | 0.756733 |
| 2010-07-01 10:00:00 | 2010-07-01 11:00:00 | 0.599467 |
| 2010-07-01 11:00:00 | 2010-07-01 12:00:00 | 1.056467 |
| 2010-07-01 12:00:00 | 2010-07-01 13:00:00 | 1.252600 |
| 2010-07-01 13:00:00 | 2010-07-01 14:00:00 | 1.285567 |
| 2010-07-01 14:00:00 | 2010-07-01 15:00:00 | 0.442933 |
| 2010-07-01 15:00:00 | 2010-07-01 16:00:00 | 0.692567 |
| 2010-07-01 16:00:00 | 2010-07-01 17:00:00 | 1.281067 |
| 2010-07-01 17:00:00 | 2010-07-01 18:00:00 | 0.652033 |
| 2010-07-01 18:00:00 | 2010-07-01 19:00:00 | 1.721900 |
| 2010-07-01 19:00:00 | 2010-07-01 20:00:00 | 1.362400 |
| 2010-07-01 20:00:00 | 2010-07-01 21:00:00 | 1.099300 |
| 2010-07-01 21:00:00 | 2010-07-01 22:00:00 | 0.646267 |
| 2010-07-01 22:00:00 | 2010-07-01 23:00:00 | 0.873100 |
| 2010-07-01 23:00:00 | 2010-07-02 00:00:00 | 0.546533 |
+---------------------+---------------------+----------+
24 rows in set (5.16 sec)

I need the query to run a lot faster than this, and would have though it would be possible. Here's the results from EXPLAIN EXTENDED ...

+----+--------------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------------------------------------------+
| id | select_type        | table      | type  | possible_keys     | key             | key_len | ref   | rows  | Extra                                        |
+----+--------------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------------------------------------------+
|  1 | PRIMARY            | quantities | range | start_timestamp   | start_timestamp | 8       | NULL  |    89 | Using where; Using temporary; Using filesort |
|  2 | DEPENDENT SUBQUERY | prices     | ref   | timestamp,type_id | type_id         | 4       | const | 22930 | Using where                                  |
+----+--------------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------------------------------------------+
2 rows in set, 3 warnings (0.00 sec)

I noticed the dependent sub query is not using the timestamp field in the key and the query is scanning loads of rows.

Can anyone help me get this running a hell of a lot faster?

Here are the SQL statements required to create the schema and fill it with a lot of data (2 months worth)

# Create prices table

CREATE TABLE `prices` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `timestamp` datetime NOT NULL,
  `type_id` int(11) NOT NULL,
  `price` float(8,2) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `timestamp` (`timestamp`),
  KEY `type_id` (`type_id`)
) ENGINE=MyISAM;

# Create quantities table

CREATE TABLE `quantities` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `start_timestamp` datetime NOT NULL,
  `end_timestamp` datetime NOT NULL,
  `quantity` float(7,2) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `start_timestamp` (`start_timestamp`),
  KEY `end_timestamp` (`end_timestamp`)
) ENGINE=MyISAM;

# Insert first 2 rows into prices, one for each of 2 types, starting 64 days ago

INSERT INTO `prices` (`id`, `timestamp`, `type_id`, `price`) VALUES
(NULL, DATE_SUB(CURDATE(), INTERVAL 64 DAY), '1', RAND()),
(NULL, DATE_SUB(CURDATE(), INTERVAL 64 DAY), '2', RAND());

# Fill the prices table with a record for each type, for every 5 minutes, for the next 64 days

INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 32 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 16 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 8 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 4 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 2 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 1 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 12 HOUR), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 6 HOUR), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 3 HOUR), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 90 MINUTE), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 45 MINUTE), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 20 MINUTE), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 10 MINUTE), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 5 MINUTE), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_SUB(`timestamp`, INTERVAL 5 MINUTE), `type_id`, RAND() FROM prices WHERE MOD( (TIME_TO_SEC( `timestamp`) - TIME_TO_SEC(CONCAT(DATE_SUB(CURDATE(), INTERVAL 64 DAY), ' 00:00:00')) ), 45 *60 ) = 0 AND `timestamp` > CONCAT(DATE_SUB(CURDATE(), INTERVAL 64 DAY), ' 00:00:00');

# Insert first row into quantities, start timestamp is 64 days ago, end timestamp is start timestamp plus 15 minutes

INSERT INTO `quantities` (`id`, `start_timestamp`, `end_timestamp`, `quantity`) VALUES (NULL, DATE_SUB(CURDATE(), INTERVAL 64 DAY), DATE_SUB(CURDATE(), INTERVAL '63 23:45' DAY_MINUTE), RAND());

# Fill the quantities table with a record for each 15 minute period for the next 64 days

INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 32 DAY), DATE_ADD(`end_timestamp`, INTERVAL 32 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 16 DAY), DATE_ADD(`end_timestamp`, INTERVAL 16 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 8 DAY), DATE_ADD(`end_timestamp`, INTERVAL 8 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 4 DAY), DATE_ADD(`end_timestamp`, INTERVAL 4 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 2 DAY), DATE_ADD(`end_timestamp`, INTERVAL 2 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 1 DAY), DATE_ADD(`end_timestamp`, INTERVAL 1 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 12 HOUR), DATE_ADD(`end_timestamp`, INTERVAL 12 HOUR), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 6 HOUR), DATE_ADD(`end_timestamp`, INTERVAL 6 HOUR), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 3 HOUR), DATE_ADD(`end_timestamp`, INTERVAL 3 HOUR), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 90 MINUTE), DATE_ADD(`end_timestamp`, INTERVAL 90 MINUTE), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 45 MINUTE), DATE_ADD(`end_timestamp`, INTERVAL 45 MINUTE), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 15 MINUTE), DATE_ADD(`end_timestamp`, INTERVAL 15 MINUTE), RAND() FROM quantities;
INSERT INTO quantities (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_SUB(`start_timestamp`, INTERVAL 15 MINUTE), DATE_SUB(`end_timestamp`, INTERVAL 15 MINUTE), RAND() FROM quantities WHERE MOD( (TIME_TO_SEC( `start_timestamp`) - TIME_TO_SEC(CONCAT(DATE_SUB(CURDATE(), INTERVAL 64 DAY), ' 00:00:00')) ), 45 * 60 ) = 0 AND `start_timestamp` > CONCAT(DATE_SUB(CURDATE(), INTERVAL 64 DAY), ' 00:00:00');
like image 348
neilcrookes Avatar asked Jul 22 '10 10:07

neilcrookes


People also ask

Is it better to use CTE or subquery?

CTE can be more readable: Another advantage of CTE is CTE are more readable than Subqueries. Since CTE can be reusable, you can write less code using CTE than using subquery. Also, people tend to follow the logic and ideas easier in sequence than in a nested fashion.

Are correlated subqueries slow?

Answer: Correlated subqueries are usually used for EXISTS Booleans, and scalar subqueries (e.g. subqueries in the SELECT clause). Correlated subqueries and slow because the sub-query is executed ONCE for each row returned by the outer query.

Which is faster subquery or correlated subquery?

Speed and Performance A correlated subquery is much slower than a non-correlated subquery because in the former, the inner query executes for each row of the outer query. This means if your table has n rows then whole processing will take the n * n = n^2 time, as compared to 2n times taken by a non-correlated subquery.


2 Answers

Here is my first attempt. This one is dirty and uses the following properties on data:

  • there are three 5 minute prices for each quarter in quantities (if this is violated in data the query will not work)
  • notice for each and cardinality of three, this is not guaranteed by data integrity checks so therefore I call it dirty
  • it is also not flexible to changes in periods

Query 1:

SELECT sql_no_cache
    min(q.start_timestamp) as start,  
    max(q.end_timestamp) as end, 
    sum((p1.price + p2.price + p3.price)/3*q.quantity) as total 
FROM 
    quantities q join 
    prices p1 on q.start_timestamp = p1.timestamp and p1.type_id = 1 join 
    prices p2 on p2.timestamp = adddate(q.start_timestamp, interval 5 minute) and p2.type_id = 1 join 
    prices p3 on p3.timestamp = adddate(q.start_timestamp, interval 10 minute) and p3.type_id = 1 
WHERE 
    q.start_timestamp between '2010-07-01 00:00:00' and '2010-07-01 23:59:59' 
GROUP BY hour(q.start_timestamp);

This one returns results in 0.01 sec on my slow testing machine, where original query runs in ~6 sec, and gnarf's query in ~0.85 sec (all queries always tested with SQL_NO_CACHE keyword which does not reuse the results, but on a warm database).

EDIT: Here is a version that is not sensitive to missing rows on the price side Query 1a

SELECT sql_no_cache
    min(q.start_timestamp) as start,  
    max(q.end_timestamp) as end, 
    sum( ( COALESCE(p1.price,0) + COALESCE(p2.price,0) + COALESCE(p3.price,0) ) / ( 
         3 -
         COALESCE(p1.price-p1.price,1) - 
         COALESCE(p2.price-p2.price,1) - 
         COALESCE(p3.price-p3.price,1)
        )
       *q.quantity) as total 
FROM 
    quantities q LEFT JOIN 
    prices p1 on q.start_timestamp = p1.timestamp and p1.type_id = 1 LEFT JOIN
    prices p2 on p2.timestamp = adddate(q.start_timestamp, interval 5 minute) and p2.type_id = 1 LEFT JOIN
    prices p3 on p3.timestamp = adddate(q.start_timestamp, interval 10 minute) and p3.type_id = 1 
WHERE 
    q.start_timestamp between '2010-07-01 00:00:00' and '2010-07-01 23:59:59' 
GROUP BY hour(q.start_timestamp);

EDIT2: Query 2: Here is a direct improvement, and different approach, to your query with minimal changes that brings the execuction time to ~0.22 sec on my machine

SELECT sql_no_cache
MIN( `quantities`.`start_timestamp` ) AS `start`,
MAX( `quantities`.`end_timestamp` ) AS `end`,
SUM( `quantities`.`quantity` * (
  SELECT AVG( `prices`.`price` )
  FROM `prices`
  WHERE 
    `prices`.`timestamp` >= '2010-07-01 00:00:00' 
    AND `prices`.`timestamp` < '2010-07-02 00:00:00' 
    AND `prices`.`timestamp` >= `quantities`.`start_timestamp`
    AND `prices`.`timestamp` < `quantities`.`end_timestamp`
    AND `prices`.`type_id` = 1
) ) AS total
FROM `quantities`
WHERE `quantities`.`start_timestamp` >= '2010-07-01 00:00:00'
AND `quantities`.`start_timestamp` < '2010-07-02 00:00:00'
GROUP BY HOUR(  `quantities`.`start_timestamp` );

That is mysql 5.1, I think I have read that in 5.5 this kind of thing (merging indexes) will be available to the query planner. Also, if you could make your start_timestamp and timestamp be related through foreign key that should allow these kind of correlated queries to make use of indexes (but for this you would need to modify design and establish some sort of timeline table which could then be referenced by quantities and prices both).

Query 3: Finally, the last version which does it in ~0.03 sec, but should be as robust and flexible as Query 2

SELECT sql_no_cache
MIN(start),
MAX(end),
SUM(subtotal)
FROM 
(
SELECT sql_no_cache
q.`start_timestamp` AS `start`,
q.`end_timestamp` AS `end`,
AVG(p.`price` * q.`quantity`) AS `subtotal`
FROM `quantities` q
LEFT JOIN `prices` p ON p.timestamp >= q.start_timestamp AND 
                        p.timestamp < q.end_timestamp AND
                        p.timestamp >= '2010-07-01 00:00:00' AND 
                        p.`timestamp` < '2010-07-02 00:00:00' 
WHERE q.`start_timestamp` >= '2010-07-01 00:00:00' 
AND q.`start_timestamp` < '2010-07-02 00:00:00'
AND p.type_id = 1
GROUP BY q.`start_timestamp`
) forced_tmp
GROUP BY hour( start );

NOTE: Do not forget to remove sql_no_cache keywords in production.

There are many counter intuitive tricks applied in the above queries (sometimes conditions repeated in the join condition speed up queries, sometimes they slow them down). Mysql is great little RDBMS and really fast when it comes to relatively simple queries, but when the complexity increases it is easy to run into the above scenarios.

So in general, I apply the following principle to set my expectations regarding the performance of a query:

  • if the base result set has < 1,000 rows then query should do its business in ~0.01 sec (base result set is the number of rows that functionally determine resulting set)

In this particular case you start with less then 1000 rows (all the prices and quantities in one day, with 15 minutes precision) and from that you should be able to compute the final results.

like image 101
Unreason Avatar answered Oct 05 '22 23:10

Unreason


This should return the same results and perform slightly faster:

SELECT
  MIN( `quantities`.`start_timestamp` ) AS `start`,
  MAX( `quantities`.`end_timestamp` ) AS `end`,
  SUM( `quantities`.`quantity` * `prices`.`price` ) 
   * COUNT(DISTINCT `quantities`.`id`) 
   / COUNT(DISTINCT `prices`.`id`)
    AS total
FROM `quantities`
JOIN `prices` ON `prices`.`timestamp` >= `quantities`.`start_timestamp`
  AND `prices`.`timestamp` < `quantities`.`end_timestamp`
  AND `prices`.`type_id` = 1
WHERE `quantities`.`start_timestamp` >= '2010-07-01 00:00:00'
  AND `quantities`.`start_timestamp` < '2010-07-02 00:00:00'
GROUP BY HOUR(  `quantities`.`start_timestamp` );

Since you can't calculate AVG() inside the SUM(), I had to do some interesting COUNT(DISTINCT) to calculate the number of prices returned per quantities. I'm wondering if this gives you the same results with "real" data...

Using JOIN:

+----+-------------+------------+-------+-------------------------------+-----------------+---------+------+-------+----------+----------------------------------------------+
| id | select_type | table      | type  | possible_keys                 | key             | key_len | ref  | rows  | filtered | Extra                                        |
+----+-------------+------------+-------+-------------------------------+-----------------+---------+------+-------+----------+----------------------------------------------+
|  1 | SIMPLE      | quantities | range | start_timestamp,end_timestamp | start_timestamp | 8       | NULL |    89 |   100.00 | Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | prices     | ALL   | timestamp,type_id             | NULL            | NULL    | NULL | 36862 |    62.20 | Using where; Using join buffer               |
+----+-------------+------------+-------+-------------------------------+-----------------+---------+------+-------+----------+----------------------------------------------+

vs. the same query only adding LEFT to the JOIN

+----+-------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------+----------------------------------------------+
| id | select_type | table      | type  | possible_keys     | key             | key_len | ref   | rows  | filtered | Extra                                        |
+----+-------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------+----------------------------------------------+
|  1 | SIMPLE      | quantities | range | start_timestamp   | start_timestamp | 8       | NULL  |    89 |   100.00 | Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | prices     | ref   | timestamp,type_id | type_id         | 4       | const | 22930 |   100.00 |                                              |
+----+-------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------+----------------------------------------------+

Interesting that LEFT can completely removes the end_timestamp as a possible key, and changes the selected keys so much, making it take 15 times as long...

This reference page could help you out a little more if you want to look at specifying index hints for your JOINS

like image 34
gnarf Avatar answered Oct 06 '22 00:10

gnarf