I'm looking to gather counts by hour. But not every hour is represented in my table.
To make sure the data always includes empty hours, I built an hour table that has hours in datetime from 2000-2037. I figure I can LEFT JOIN
data tables to this table to keep track of missing hours. But I need help doing so.
Table: date_hour
:
`hour`
2000-01-01 00:00:00
2000-01-01 01:00:00
...
2036-12-31 23:00:00
Table my_data
:
log_date field1
2015-05-01 00:31:00 1000
2015-05-01 04:19:00 2000
2015-05-01 05:19:00 1000
2015-05-01 07:19:00 100
2015-05-01 07:35:00 6000
Desired result:
hour count
2015-05-01 00:00:00 1
2015-05-01 01:00:00 0
2015-05-01 02:00:00 0
2015-05-01 03:00:00 0
2015-05-01 04:00:00 1
2015-05-01 05:00:00 1
2015-05-01 06:00:00 0
2015-05-01 07:00:00 2
MySQL attempt:
SELECT
dh.hour,
COUNT(md.*) AS count
FROM
date_hour dh
LEFT JOIN my_data md ON dh.hour = ????md.log_date????
WHERE
dh.hour >= '2015-05-01'
AND dh.hour < '2015-05-02'
GROUP BY
dh.hour
ORDER BY
dh.hour;
What's the most efficient way to accomplish these counts? Assume each day has 100k-1MM records, with the goal of measuring at least 30 days of data at a time.
Can use DATE_FORMAT
to strip off the minutes and seconds like :
query
SELECT
dh.hour,
COUNT(md.*) AS count
FROM
date_hour dh LEFT JOIN my_data md
ON dh.hour = DATE_FORMAT(md.log_date, "%Y-%m-%d %H:00:00")
WHERE
dh.hour >= '2015-05-01'
AND dh.hour < '2015-05-02'
GROUP BY
dh.hour
ORDER BY
dh.hour
;
output
+------------------------+-----------+
| hour | count |
+------------------------+-----------+
| 2015-05-01 00:00:00 | 1 |
| 2015-05-01 01:00:00 | 0 |
| 2015-05-01 02:00:00 | 0 |
| 2015-05-01 03:00:00 | 0 |
| 2015-05-01 04:00:00 | 1 |
| 2015-05-01 05:00:00 | 1 |
| 2015-05-01 06:00:00 | 0 |
| 2015-05-01 07:00:00 | 2 |
| ... trailing hours ... | allzeroes |
+------------------------+-----------+
where everything after 2015-05-01 08:00:00 is zeroes ( no data in my_data )
sqlfiddle
If you LEFT JOIN
on result of the function like DATE_FORMAT
or any other function it will produce correct results, but it may be way slower than it could have been. If performance of the simple approach that is shown in the answer by @amdixon is suitable, then just use it.
However, there are few things that you can do to make it faster. You may want to consider them once your table grows to 30M rows (30 days, 1M rows per day).
It goes without saying that table date_hour
must have an index (actually, primary key) on the hour
column. This would help to quickly find few rows of the particular day when you use search condition like this:
WHERE
date_hour.hour >= '2015-05-01 00:00:00'
AND date_hour.hour < '2015-05-02 00:00:00'
Another important thing to keep in mind - if you have 1M rows for a given day and you need to calculate counts for that day, then the server has to read at least these 1M rows. You can't avoid that. Reading 1M rows would not be fast, but if the whole table is 30M rows, then it is obviously better to read just 1M rows than the whole table.
So, the server should be able to find rows of the particular day efficiently (read - there should be an index).
Any query that removes minutes and seconds from the log_date
column on the fly while joining would not be able to use an index, so the server would have to scan the whole table my_data
.
Option 1
Add an index on my_data
.log_date
. Add explicit filter to the WHERE
clause. It would not change results, but hopefully would give a good hint to the server to use the index on my_data
.log_date
to find necessary rows and avoid full scan. Maybe MySQL is smart enough when you convert datetime
to string using DATE_FORMAT
and it would not convert date_hour.hour
to string as well for comparison (thus negating the fact that there is an index on date_hour.hour
). Maybe not. I prefer the following method to remove minutes and seconds from the datetime
without converting it to string.
TIMESTAMPADD(HOUR,
TIMESTAMPDIFF(HOUR,'2015-01-01 00:00:00',DateTimeValue),
'2015-01-01 00:00:00')
We can use any constant instead of '2015-01-01' as long as it doesn't have minutes and seconds. The same method can be used to truncate datetime
to any other boundary - minute, day, week, month, year.
SELECT
date_hour.hour,
COUNT(my_data.log_date) AS count
FROM
date_hour
LEFT JOIN my_data ON
date_hour.hour = TIMESTAMPADD(HOUR, TIMESTAMPDIFF(HOUR,'2015-01-01 00:00:00',my_data.log_date), '2015-01-01 00:00:00')
WHERE
date_hour.hour >= '2015-05-01 00:00:00' AND
date_hour.hour < '2015-05-02 00:00:00' AND
my_data.log_date >= '2015-05-01 00:00:00' AND
my_data.log_date < '2015-05-02 00:00:00'
GROUP BY
date_hour.hour
ORDER BY
date_hour.hour
;
Even if the server would use indexes on date_hour
and my_data
to find necessary rows, still it has to join based on the result of the function and with 1M rows it may be difficult. Most likely it would have to store 1M results of the function into a temporary table, sort it and then join. These kind of sorts are often expensive, especially if they are done not in memory (with 1M rows it is quite likely to be done on disk).
Option 2
To optimize this further and avoid manipulation of the datetime
on the fly I would consider adding a persistent column log_hour
to the my_data
table, which would be populated together with the main column log_date
and would contain the log_date
value without minutes and seconds. You can think of it as pre-calculating or caching. Once you have an index on this column log_hour
the server should be able to both efficiently find and join found rows. The query becomes trivial and it doesn't use log_date
column at all, it uses only log_hour
:
SELECT
date_hour.hour,
COUNT(my_data.log_hour) AS count
FROM
date_hour
LEFT JOIN my_data ON date_hour.hour = my_data.log_hour
WHERE
date_hour.hour >= '2015-05-01 00:00:00' AND
date_hour.hour < '2015-05-02 00:00:00' AND
my_data.log_hour >= '2015-05-01 00:00:00' AND
my_data.log_hour < '2015-05-02 00:00:00'
GROUP BY
date_hour.hour
ORDER BY
date_hour.hour
;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With