Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I select hourly counts from a table, including missing hours?

I'm looking to gather counts by hour. But not every hour is represented in my table.

To make sure the data always includes empty hours, I built an hour table that has hours in datetime from 2000-2037. I figure I can LEFT JOIN data tables to this table to keep track of missing hours. But I need help doing so.

Table: date_hour:

`hour`
2000-01-01 00:00:00
2000-01-01 01:00:00
...
2036-12-31 23:00:00

Table my_data:

log_date               field1
2015-05-01 00:31:00    1000
2015-05-01 04:19:00    2000    
2015-05-01 05:19:00    1000
2015-05-01 07:19:00    100
2015-05-01 07:35:00    6000

Desired result:

hour                   count
2015-05-01 00:00:00    1
2015-05-01 01:00:00    0
2015-05-01 02:00:00    0
2015-05-01 03:00:00    0
2015-05-01 04:00:00    1
2015-05-01 05:00:00    1
2015-05-01 06:00:00    0
2015-05-01 07:00:00    2

MySQL attempt:

SELECT
    dh.hour,
    COUNT(md.*) AS count
FROM
    date_hour dh
    LEFT JOIN my_data md ON dh.hour = ????md.log_date????
WHERE
        dh.hour >= '2015-05-01'
    AND dh.hour <  '2015-05-02'
GROUP BY
    dh.hour
ORDER BY
    dh.hour;

What's the most efficient way to accomplish these counts? Assume each day has 100k-1MM records, with the goal of measuring at least 30 days of data at a time.

like image 395
Ryan Avatar asked May 09 '15 03:05

Ryan


2 Answers

Can use DATE_FORMAT to strip off the minutes and seconds like :

query

SELECT
    dh.hour,
    COUNT(md.*) AS count
FROM
    date_hour dh LEFT JOIN my_data md 
    ON dh.hour = DATE_FORMAT(md.log_date, "%Y-%m-%d %H:00:00")
WHERE
        dh.hour >= '2015-05-01'
    AND dh.hour <  '2015-05-02'
GROUP BY
    dh.hour
ORDER BY
    dh.hour
;

output

+------------------------+-----------+
|          hour          |   count   |
+------------------------+-----------+
| 2015-05-01 00:00:00    | 1         |
| 2015-05-01 01:00:00    | 0         |
| 2015-05-01 02:00:00    | 0         |
| 2015-05-01 03:00:00    | 0         |
| 2015-05-01 04:00:00    | 1         |
| 2015-05-01 05:00:00    | 1         |
| 2015-05-01 06:00:00    | 0         |
| 2015-05-01 07:00:00    | 2         |
| ... trailing hours ... | allzeroes |
+------------------------+-----------+

where everything after 2015-05-01 08:00:00 is zeroes ( no data in my_data )

sqlfiddle

like image 149
amdixon Avatar answered Oct 06 '22 07:10

amdixon


If you LEFT JOIN on result of the function like DATE_FORMAT or any other function it will produce correct results, but it may be way slower than it could have been. If performance of the simple approach that is shown in the answer by @amdixon is suitable, then just use it.

However, there are few things that you can do to make it faster. You may want to consider them once your table grows to 30M rows (30 days, 1M rows per day).

It goes without saying that table date_hour must have an index (actually, primary key) on the hour column. This would help to quickly find few rows of the particular day when you use search condition like this:

WHERE
        date_hour.hour >= '2015-05-01 00:00:00'
    AND date_hour.hour <  '2015-05-02 00:00:00'

Another important thing to keep in mind - if you have 1M rows for a given day and you need to calculate counts for that day, then the server has to read at least these 1M rows. You can't avoid that. Reading 1M rows would not be fast, but if the whole table is 30M rows, then it is obviously better to read just 1M rows than the whole table.

So, the server should be able to find rows of the particular day efficiently (read - there should be an index). Any query that removes minutes and seconds from the log_date column on the fly while joining would not be able to use an index, so the server would have to scan the whole table my_data.

Option 1

Add an index on my_data.log_date. Add explicit filter to the WHERE clause. It would not change results, but hopefully would give a good hint to the server to use the index on my_data.log_date to find necessary rows and avoid full scan. Maybe MySQL is smart enough when you convert datetime to string using DATE_FORMAT and it would not convert date_hour.hour to string as well for comparison (thus negating the fact that there is an index on date_hour.hour). Maybe not. I prefer the following method to remove minutes and seconds from the datetime without converting it to string.

TIMESTAMPADD(HOUR,
    TIMESTAMPDIFF(HOUR,'2015-01-01 00:00:00',DateTimeValue),
    '2015-01-01 00:00:00')

We can use any constant instead of '2015-01-01' as long as it doesn't have minutes and seconds. The same method can be used to truncate datetime to any other boundary - minute, day, week, month, year.

SELECT
    date_hour.hour,
    COUNT(my_data.log_date) AS count
FROM
    date_hour
    LEFT JOIN my_data ON 
        date_hour.hour = TIMESTAMPADD(HOUR, TIMESTAMPDIFF(HOUR,'2015-01-01 00:00:00',my_data.log_date), '2015-01-01 00:00:00')
WHERE
    date_hour.hour   >= '2015-05-01 00:00:00' AND
    date_hour.hour   <  '2015-05-02 00:00:00' AND
    my_data.log_date >= '2015-05-01 00:00:00' AND
    my_data.log_date <  '2015-05-02 00:00:00'
GROUP BY
    date_hour.hour
ORDER BY
    date_hour.hour
;

Even if the server would use indexes on date_hour and my_data to find necessary rows, still it has to join based on the result of the function and with 1M rows it may be difficult. Most likely it would have to store 1M results of the function into a temporary table, sort it and then join. These kind of sorts are often expensive, especially if they are done not in memory (with 1M rows it is quite likely to be done on disk).

Option 2

To optimize this further and avoid manipulation of the datetime on the fly I would consider adding a persistent column log_hour to the my_data table, which would be populated together with the main column log_date and would contain the log_date value without minutes and seconds. You can think of it as pre-calculating or caching. Once you have an index on this column log_hour the server should be able to both efficiently find and join found rows. The query becomes trivial and it doesn't use log_date column at all, it uses only log_hour:

SELECT
    date_hour.hour,
    COUNT(my_data.log_hour) AS count
FROM
    date_hour
    LEFT JOIN my_data ON date_hour.hour = my_data.log_hour
WHERE
    date_hour.hour   >= '2015-05-01 00:00:00' AND
    date_hour.hour   <  '2015-05-02 00:00:00' AND
    my_data.log_hour >= '2015-05-01 00:00:00' AND
    my_data.log_hour <  '2015-05-02 00:00:00'
GROUP BY
    date_hour.hour
ORDER BY
    date_hour.hour
;
like image 36
Vladimir Baranov Avatar answered Oct 06 '22 09:10

Vladimir Baranov