Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find the one hour period with the most datapoints?

I have a database table with hundreds of thousands of forum posts, and I would like to find out what hour-long period contains the most number of posts.

I could crawl forward one minute at a time, keeping an array of timestamps and keeping track of what hour had the most in it, but I feel like there is a much better way to do this. I will be running this operation on a year of posts so checking every minute in a year seems pretty awful.

Ideally there would be a way to do this inside a single database query.

like image 587
Allie the Icon Avatar asked Feb 03 '09 19:02

Allie the Icon


3 Answers

Given a table filled with every minute in the year you are interested in Minutes and a table Posts with a Time column:

select top 1 minutes.time, count (posts.time)
from Minutes
   left join posts on posts.time >= minutes.time AND posts.time < dateadd(hour, 1, Minutes.Time)
group by minutes.time
order by count (posts.time) desc

To solve generating the minutes table, you can use a function like ufn_GenerateIntegers. Then the function becomes

select top 5 minutes.time, count (posts.time)
from (select dateadd(minute, IntValue, '2008-01-01') as Time from ufn_GenerateIntegers(525600)) Minutes
   left join posts on posts.time >= minutes.time AND posts.time < dateadd(hour, 1, Minutes.Time)
group by minutes.time
order by count(posts.time) desc

I just did a test run with about 5000 random posts and it took 16 seconds on my machine. So, not trivial, but not rediculous for the occasional one-off query. Fortunately, this is a data-point you can calculate one a day or even once a month and cache if you want to display the value frequently.

Take a look at lassevk's improvement.

like image 199
Eclipse Avatar answered Oct 18 '22 01:10

Eclipse


Binning will work if you want to look at intervals such as 10:00 - 11:00. However if you had a sudden flurry of interest from 10:30 - 11:30 then it will be split across two bins, and hence may be hidden by an smaller number of hits that happened to fit entirely within a single clock hour.

The only way to avoid this problem is to generate a list sorted by time and step through it. Something like this:

max = 0; maxTime = 0
for each $item in the list:
   push $item onto queue
   while head of queue is more than an hour before $item
      drop queue head.
   if queue.count > max then max = queue.count; maxTime = $item.time

That way you only need to hold a 1 hour window in memory rather than the whole list.

like image 32
Paul Johnson Avatar answered Oct 18 '22 03:10

Paul Johnson


Treat the timestamp of every post as the start of such an hour, and count all other posts that fall within that hour, including the post that started it. Sort the resulting hours in descending order by the number of posts in each of them.

Having done that, you'll find the topmost single "hour" that has the most posts in it, but this period of time might not be exactly one hour long, it might be shorter (but never longer).

To get a "prettier" period, you can calculate how long it really is, divide by two, and adjust the start of the period back by that amount and the end forward, this will "center" the posts inside the hour. This adjustment is guaranteed to not include any new posts, so the count is still valid. If posts are close enough to suddenly be included in the period after you have expanded it to one hour, then an earlier point would've had "the most posts" in it instead of the one you picked.

If this is an SQL question, you can reuse the SQL that Josh posted here, just replace the Minutes table with another link to your posts table.


Another method you can use is to use a sliding window.

First sort all the posts according to the timestamp. Keep track of posts using a list, a linked list could be used for this.

Now, for each post, add it to the end of the list. Then, for each post from the start of the list, if that post is more than one hour before the post you just added, remove it from the list.

After doing that 2-step operation for a single new post in the list, check if the number of posts in the list is more than a previous maximum, and if it is, either make a copy of the list or at least store the post you just added.

After you're finished, you've got the "copy of the list" with the most posts in an hour, or you got the post that is the end of a 1-hour window that contains the most posts.

Pseudo-code:

initialize posts-window-list to empty list
for each post in sorted-posts-list:
    add post to end of posts-window-list
    for each other-post from start of posts-window-list:
        if other-post is more than one hour older than post, remove it
        otherwise, end this inner loop
    if number of posts in list is more than previous maximum:
        make copy of list, this is the new maximum
like image 2
Lasse V. Karlsen Avatar answered Oct 18 '22 01:10

Lasse V. Karlsen