I need an SQL guru to help me speed up my query.
I have 2 tables, quantities and prices. quantities records a quantity value between 2 timestamps, 15 minutes apart. prices records a price for a given timestamp, for a given price type and there is a price 5 record for every 5 minutes.
I need 2 work out the total price for each period, e.g. hour or day, between two timestamps. This is calculated by the sum of the (quantity multiplied by the average of the 3 prices in the 15 minute quantity window) in each period.
For example, let's say I want to see the total price each hour for 1 day. The total price value in each row in the result set is the sum of the total prices for each of the four 15 minute periods in that hour. And the total price for each 15 minute period is calculated by multiplying the quantity value in that period by the average of the 3 prices (one for each 5 minutes) in that quantity's period.
Here's the query I'm using, and the results:
SELECT
MIN( `quantities`.`start_timestamp` ) AS `start`,
MAX( `quantities`.`end_timestamp` ) AS `end`,
SUM( `quantities`.`quantity` * (
SELECT AVG( `prices`.`price` )
FROM `prices`
WHERE `prices`.`timestamp` >= `quantities`.`start_timestamp`
AND `prices`.`timestamp` < `quantities`.`end_timestamp`
AND `prices`.`type_id` = 1
) ) AS total
FROM `quantities`
WHERE `quantities`.`start_timestamp` >= '2010-07-01 00:00:00'
AND `quantities`.`start_timestamp` < '2010-07-02 00:00:00'
GROUP BY HOUR( `quantities`.`start_timestamp` );
+---------------------+---------------------+----------+
| start | end | total |
+---------------------+---------------------+----------+
| 2010-07-01 00:00:00 | 2010-07-01 01:00:00 | 0.677733 |
| 2010-07-01 01:00:00 | 2010-07-01 02:00:00 | 0.749133 |
| 2010-07-01 02:00:00 | 2010-07-01 03:00:00 | 0.835467 |
| 2010-07-01 03:00:00 | 2010-07-01 04:00:00 | 0.692233 |
| 2010-07-01 04:00:00 | 2010-07-01 05:00:00 | 0.389533 |
| 2010-07-01 05:00:00 | 2010-07-01 06:00:00 | 0.335300 |
| 2010-07-01 06:00:00 | 2010-07-01 07:00:00 | 1.231467 |
| 2010-07-01 07:00:00 | 2010-07-01 08:00:00 | 0.352800 |
| 2010-07-01 08:00:00 | 2010-07-01 09:00:00 | 1.447200 |
| 2010-07-01 09:00:00 | 2010-07-01 10:00:00 | 0.756733 |
| 2010-07-01 10:00:00 | 2010-07-01 11:00:00 | 0.599467 |
| 2010-07-01 11:00:00 | 2010-07-01 12:00:00 | 1.056467 |
| 2010-07-01 12:00:00 | 2010-07-01 13:00:00 | 1.252600 |
| 2010-07-01 13:00:00 | 2010-07-01 14:00:00 | 1.285567 |
| 2010-07-01 14:00:00 | 2010-07-01 15:00:00 | 0.442933 |
| 2010-07-01 15:00:00 | 2010-07-01 16:00:00 | 0.692567 |
| 2010-07-01 16:00:00 | 2010-07-01 17:00:00 | 1.281067 |
| 2010-07-01 17:00:00 | 2010-07-01 18:00:00 | 0.652033 |
| 2010-07-01 18:00:00 | 2010-07-01 19:00:00 | 1.721900 |
| 2010-07-01 19:00:00 | 2010-07-01 20:00:00 | 1.362400 |
| 2010-07-01 20:00:00 | 2010-07-01 21:00:00 | 1.099300 |
| 2010-07-01 21:00:00 | 2010-07-01 22:00:00 | 0.646267 |
| 2010-07-01 22:00:00 | 2010-07-01 23:00:00 | 0.873100 |
| 2010-07-01 23:00:00 | 2010-07-02 00:00:00 | 0.546533 |
+---------------------+---------------------+----------+
24 rows in set (5.16 sec)
I need the query to run a lot faster than this, and would have though it would be possible. Here's the results from EXPLAIN EXTENDED ...
+----+--------------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------------------------------------------+
| 1 | PRIMARY | quantities | range | start_timestamp | start_timestamp | 8 | NULL | 89 | Using where; Using temporary; Using filesort |
| 2 | DEPENDENT SUBQUERY | prices | ref | timestamp,type_id | type_id | 4 | const | 22930 | Using where |
+----+--------------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------------------------------------------+
2 rows in set, 3 warnings (0.00 sec)
I noticed the dependent sub query is not using the timestamp field in the key and the query is scanning loads of rows.
Can anyone help me get this running a hell of a lot faster?
Here are the SQL statements required to create the schema and fill it with a lot of data (2 months worth)
# Create prices table
CREATE TABLE `prices` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`timestamp` datetime NOT NULL,
`type_id` int(11) NOT NULL,
`price` float(8,2) NOT NULL,
PRIMARY KEY (`id`),
KEY `timestamp` (`timestamp`),
KEY `type_id` (`type_id`)
) ENGINE=MyISAM;
# Create quantities table
CREATE TABLE `quantities` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`start_timestamp` datetime NOT NULL,
`end_timestamp` datetime NOT NULL,
`quantity` float(7,2) NOT NULL,
PRIMARY KEY (`id`),
KEY `start_timestamp` (`start_timestamp`),
KEY `end_timestamp` (`end_timestamp`)
) ENGINE=MyISAM;
# Insert first 2 rows into prices, one for each of 2 types, starting 64 days ago
INSERT INTO `prices` (`id`, `timestamp`, `type_id`, `price`) VALUES
(NULL, DATE_SUB(CURDATE(), INTERVAL 64 DAY), '1', RAND()),
(NULL, DATE_SUB(CURDATE(), INTERVAL 64 DAY), '2', RAND());
# Fill the prices table with a record for each type, for every 5 minutes, for the next 64 days
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 32 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 16 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 8 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 4 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 2 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 1 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 12 HOUR), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 6 HOUR), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 3 HOUR), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 90 MINUTE), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 45 MINUTE), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 20 MINUTE), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 10 MINUTE), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 5 MINUTE), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_SUB(`timestamp`, INTERVAL 5 MINUTE), `type_id`, RAND() FROM prices WHERE MOD( (TIME_TO_SEC( `timestamp`) - TIME_TO_SEC(CONCAT(DATE_SUB(CURDATE(), INTERVAL 64 DAY), ' 00:00:00')) ), 45 *60 ) = 0 AND `timestamp` > CONCAT(DATE_SUB(CURDATE(), INTERVAL 64 DAY), ' 00:00:00');
# Insert first row into quantities, start timestamp is 64 days ago, end timestamp is start timestamp plus 15 minutes
INSERT INTO `quantities` (`id`, `start_timestamp`, `end_timestamp`, `quantity`) VALUES (NULL, DATE_SUB(CURDATE(), INTERVAL 64 DAY), DATE_SUB(CURDATE(), INTERVAL '63 23:45' DAY_MINUTE), RAND());
# Fill the quantities table with a record for each 15 minute period for the next 64 days
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 32 DAY), DATE_ADD(`end_timestamp`, INTERVAL 32 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 16 DAY), DATE_ADD(`end_timestamp`, INTERVAL 16 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 8 DAY), DATE_ADD(`end_timestamp`, INTERVAL 8 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 4 DAY), DATE_ADD(`end_timestamp`, INTERVAL 4 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 2 DAY), DATE_ADD(`end_timestamp`, INTERVAL 2 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 1 DAY), DATE_ADD(`end_timestamp`, INTERVAL 1 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 12 HOUR), DATE_ADD(`end_timestamp`, INTERVAL 12 HOUR), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 6 HOUR), DATE_ADD(`end_timestamp`, INTERVAL 6 HOUR), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 3 HOUR), DATE_ADD(`end_timestamp`, INTERVAL 3 HOUR), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 90 MINUTE), DATE_ADD(`end_timestamp`, INTERVAL 90 MINUTE), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 45 MINUTE), DATE_ADD(`end_timestamp`, INTERVAL 45 MINUTE), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 15 MINUTE), DATE_ADD(`end_timestamp`, INTERVAL 15 MINUTE), RAND() FROM quantities;
INSERT INTO quantities (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_SUB(`start_timestamp`, INTERVAL 15 MINUTE), DATE_SUB(`end_timestamp`, INTERVAL 15 MINUTE), RAND() FROM quantities WHERE MOD( (TIME_TO_SEC( `start_timestamp`) - TIME_TO_SEC(CONCAT(DATE_SUB(CURDATE(), INTERVAL 64 DAY), ' 00:00:00')) ), 45 * 60 ) = 0 AND `start_timestamp` > CONCAT(DATE_SUB(CURDATE(), INTERVAL 64 DAY), ' 00:00:00');
CTE can be more readable: Another advantage of CTE is CTE are more readable than Subqueries. Since CTE can be reusable, you can write less code using CTE than using subquery. Also, people tend to follow the logic and ideas easier in sequence than in a nested fashion.
Answer: Correlated subqueries are usually used for EXISTS Booleans, and scalar subqueries (e.g. subqueries in the SELECT clause). Correlated subqueries and slow because the sub-query is executed ONCE for each row returned by the outer query.
Speed and Performance A correlated subquery is much slower than a non-correlated subquery because in the former, the inner query executes for each row of the outer query. This means if your table has n rows then whole processing will take the n * n = n^2 time, as compared to 2n times taken by a non-correlated subquery.
Here is my first attempt. This one is dirty and uses the following properties on data:
Query 1:
SELECT sql_no_cache
min(q.start_timestamp) as start,
max(q.end_timestamp) as end,
sum((p1.price + p2.price + p3.price)/3*q.quantity) as total
FROM
quantities q join
prices p1 on q.start_timestamp = p1.timestamp and p1.type_id = 1 join
prices p2 on p2.timestamp = adddate(q.start_timestamp, interval 5 minute) and p2.type_id = 1 join
prices p3 on p3.timestamp = adddate(q.start_timestamp, interval 10 minute) and p3.type_id = 1
WHERE
q.start_timestamp between '2010-07-01 00:00:00' and '2010-07-01 23:59:59'
GROUP BY hour(q.start_timestamp);
This one returns results in 0.01 sec on my slow testing machine, where original query runs in ~6 sec, and gnarf's query in ~0.85 sec (all queries always tested with SQL_NO_CACHE
keyword which does not reuse the results, but on a warm database).
EDIT: Here is a version that is not sensitive to missing rows on the price side Query 1a
SELECT sql_no_cache
min(q.start_timestamp) as start,
max(q.end_timestamp) as end,
sum( ( COALESCE(p1.price,0) + COALESCE(p2.price,0) + COALESCE(p3.price,0) ) / (
3 -
COALESCE(p1.price-p1.price,1) -
COALESCE(p2.price-p2.price,1) -
COALESCE(p3.price-p3.price,1)
)
*q.quantity) as total
FROM
quantities q LEFT JOIN
prices p1 on q.start_timestamp = p1.timestamp and p1.type_id = 1 LEFT JOIN
prices p2 on p2.timestamp = adddate(q.start_timestamp, interval 5 minute) and p2.type_id = 1 LEFT JOIN
prices p3 on p3.timestamp = adddate(q.start_timestamp, interval 10 minute) and p3.type_id = 1
WHERE
q.start_timestamp between '2010-07-01 00:00:00' and '2010-07-01 23:59:59'
GROUP BY hour(q.start_timestamp);
EDIT2: Query 2: Here is a direct improvement, and different approach, to your query with minimal changes that brings the execuction time to ~0.22 sec on my machine
SELECT sql_no_cache
MIN( `quantities`.`start_timestamp` ) AS `start`,
MAX( `quantities`.`end_timestamp` ) AS `end`,
SUM( `quantities`.`quantity` * (
SELECT AVG( `prices`.`price` )
FROM `prices`
WHERE
`prices`.`timestamp` >= '2010-07-01 00:00:00'
AND `prices`.`timestamp` < '2010-07-02 00:00:00'
AND `prices`.`timestamp` >= `quantities`.`start_timestamp`
AND `prices`.`timestamp` < `quantities`.`end_timestamp`
AND `prices`.`type_id` = 1
) ) AS total
FROM `quantities`
WHERE `quantities`.`start_timestamp` >= '2010-07-01 00:00:00'
AND `quantities`.`start_timestamp` < '2010-07-02 00:00:00'
GROUP BY HOUR( `quantities`.`start_timestamp` );
That is mysql 5.1, I think I have read that in 5.5 this kind of thing (merging indexes) will be available to the query planner. Also, if you could make your start_timestamp and timestamp be related through foreign key that should allow these kind of correlated queries to make use of indexes (but for this you would need to modify design and establish some sort of timeline table which could then be referenced by quantities and prices both).
Query 3: Finally, the last version which does it in ~0.03 sec, but should be as robust and flexible as Query 2
SELECT sql_no_cache
MIN(start),
MAX(end),
SUM(subtotal)
FROM
(
SELECT sql_no_cache
q.`start_timestamp` AS `start`,
q.`end_timestamp` AS `end`,
AVG(p.`price` * q.`quantity`) AS `subtotal`
FROM `quantities` q
LEFT JOIN `prices` p ON p.timestamp >= q.start_timestamp AND
p.timestamp < q.end_timestamp AND
p.timestamp >= '2010-07-01 00:00:00' AND
p.`timestamp` < '2010-07-02 00:00:00'
WHERE q.`start_timestamp` >= '2010-07-01 00:00:00'
AND q.`start_timestamp` < '2010-07-02 00:00:00'
AND p.type_id = 1
GROUP BY q.`start_timestamp`
) forced_tmp
GROUP BY hour( start );
NOTE: Do not forget to remove sql_no_cache keywords in production.
There are many counter intuitive tricks applied in the above queries (sometimes conditions repeated in the join condition speed up queries, sometimes they slow them down). Mysql is great little RDBMS and really fast when it comes to relatively simple queries, but when the complexity increases it is easy to run into the above scenarios.
So in general, I apply the following principle to set my expectations regarding the performance of a query:
In this particular case you start with less then 1000 rows (all the prices and quantities in one day, with 15 minutes precision) and from that you should be able to compute the final results.
This should return the same results and perform slightly faster:
SELECT
MIN( `quantities`.`start_timestamp` ) AS `start`,
MAX( `quantities`.`end_timestamp` ) AS `end`,
SUM( `quantities`.`quantity` * `prices`.`price` )
* COUNT(DISTINCT `quantities`.`id`)
/ COUNT(DISTINCT `prices`.`id`)
AS total
FROM `quantities`
JOIN `prices` ON `prices`.`timestamp` >= `quantities`.`start_timestamp`
AND `prices`.`timestamp` < `quantities`.`end_timestamp`
AND `prices`.`type_id` = 1
WHERE `quantities`.`start_timestamp` >= '2010-07-01 00:00:00'
AND `quantities`.`start_timestamp` < '2010-07-02 00:00:00'
GROUP BY HOUR( `quantities`.`start_timestamp` );
Since you can't calculate AVG()
inside the SUM()
, I had to do some interesting COUNT(DISTINCT)
to calculate the number of prices
returned per quantities
. I'm wondering if this gives you the same results with "real" data...
Using JOIN
:
+----+-------------+------------+-------+-------------------------------+-----------------+---------+------+-------+----------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+------------+-------+-------------------------------+-----------------+---------+------+-------+----------+----------------------------------------------+ | 1 | SIMPLE | quantities | range | start_timestamp,end_timestamp | start_timestamp | 8 | NULL | 89 | 100.00 | Using where; Using temporary; Using filesort | | 1 | SIMPLE | prices | ALL | timestamp,type_id | NULL | NULL | NULL | 36862 | 62.20 | Using where; Using join buffer | +----+-------------+------------+-------+-------------------------------+-----------------+---------+------+-------+----------+----------------------------------------------+
vs. the same query only adding LEFT
to the JOIN
+----+-------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------+----------------------------------------------+ | 1 | SIMPLE | quantities | range | start_timestamp | start_timestamp | 8 | NULL | 89 | 100.00 | Using where; Using temporary; Using filesort | | 1 | SIMPLE | prices | ref | timestamp,type_id | type_id | 4 | const | 22930 | 100.00 | | +----+-------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------+----------------------------------------------+
Interesting that LEFT
can completely removes the end_timestamp
as a possible key, and changes the selected keys so much, making it take 15 times as long...
This reference page could help you out a little more if you want to look at specifying index hints for your JOINS
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With