Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

mysql select number of rows between time span

Tags:

sql

mysql

I'm trying to get the total number of rows between a specific amout of time or time span. Basically, let's say the following table:

CREATE TABLE IF NOT EXISTS `downloads` (
`id` int(7) NOT NULL AUTO_INCREMENT,
`stuff_id` int(7) NOT NULL,
`user_id` int(7) NOT NULL,
`dl_date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1;

And this table is populated each time someone downloads something.

So what I really need is to get a list of users (user_id) that have made more than for example 100 downloads in a periods of time of, for example 24 hours. Not in the last 24 hours, but IN that exact period of time even if it has been during christmas last year =)

Any ideas at all ?!

like image 305
eduardev Avatar asked Jan 24 '11 17:01

eduardev


2 Answers

OK, I realise I'm a bit late, but I wanted to post my answer anyway :-)

What you require can be done using a subquery, but this might take ages to complete on a large table...

Thinking about the question I came to two different approaches.

One of them has already been dealt with in the other answers, it works by starting at a specific point in time, looking at the interval that begins at this time and then looking at the interval of equal duration that immediately follows. This leads to clear, understandable results and is probably what would be required (e.g. user must not exceed 100 downloads per calender day). This however would completely miss situations in which a user does 99 downloads during the hour before midnight and another 99 in the first hour of the new day.

So if the required result is more of a "top ten downloaders list", then this is the other approach. The results here may not be as understandable at first glance, because one single download can count towards multiple intervals. This is because the intervals will (and need to) overlap.

Here's my setup. I've created the table from your statement and added two indexes:

CREATE INDEX downloads_timestamp on downloads (dl_date);
CREATE INDEX downloads_user_id on downloads (user_id);

The data I've inserted into the table:

SELECT * FROM downloads;
+----+----------+---------+---------------------+
| id | stuff_id | user_id | dl_date             |
+----+----------+---------+---------------------+
|  1 |        1 |       1 | 2011-01-24 09:00:00 |
|  2 |        1 |       1 | 2011-01-24 09:30:00 |
|  3 |        1 |       1 | 2011-01-24 09:35:00 |
|  4 |        1 |       1 | 2011-01-24 10:00:00 |
|  5 |        1 |       1 | 2011-01-24 11:00:00 |
|  6 |        1 |       1 | 2011-01-24 11:15:00 |
|  7 |        1 |       1 | 2011-01-25 09:15:00 |
|  8 |        1 |       1 | 2011-01-25 09:30:00 |
|  9 |        1 |       1 | 2011-01-25 09:45:00 |
| 10 |        1 |       2 | 2011-01-24 08:00:00 |
| 11 |        1 |       2 | 2011-01-24 12:00:00 |
| 12 |        1 |       2 | 2011-01-24 12:01:00 |
| 13 |        1 |       2 | 2011-01-24 12:02:00 |
| 14 |        1 |       2 | 2011-01-24 12:03:00 |
| 15 |        1 |       2 | 2011-01-24 12:00:00 |
| 16 |        1 |       2 | 2011-01-24 12:04:00 |
| 17 |        1 |       2 | 2011-01-24 12:05:00 |
| 18 |        1 |       2 | 2011-01-24 12:06:00 |
| 19 |        1 |       2 | 2011-01-24 12:07:00 |
| 20 |        1 |       2 | 2011-01-24 12:08:00 |
| 21 |        1 |       2 | 2011-01-24 12:09:00 |
| 22 |        1 |       2 | 2011-01-24 12:10:00 |
| 23 |        1 |       2 | 2011-01-25 14:00:00 |
| 24 |        1 |       2 | 2011-01-25 14:12:00 |
| 25 |        1 |       2 | 2011-01-25 14:25:00 |
+----+----------+---------+---------------------+
25 rows in set (0.00 sec)

As you can see, all downloads occured either yesterday or today and were executed by two different users.

Now, what we have to mind is the following: There is (mathematically) an infinite number of 24 hour intervals (or intervals of any other duration) between '2011-01-24 0:00' and '2011-01-25 23:59:59'. But as the server's precision is one second, this boils down to 86,400 intervals:

First interval:  2011-01-24 0:00:00 -> 2011-01-25 0:00:00
Second interval: 2011-01-24 0:00:01 -> 2011-01-25 0:00:01
Third interval: 2011-01-24 0:00:02 -> 2011-01-25 0:00:02
   .
   .
   .
86400th interval: 2011-01-24 23:59:59 -> 2011-01-25 23:59:59

So we could use a loop to iterate over all these intervals and calculate the number of downloads per user and per interval. Of course, not all intervals are of the same interest to us, so we can skip some of them by using the timestamps in the table as "beginning of interval".

This is what the following query does. It uses every download timestamp in the table as "start of interval", adds the interval duration and then queries the number of downloads per user during this interval.

SET @duration = '24:00:00';
SET @limit = 5;
SELECT * FROM 
    (SELECT t1.user_id, 
            t1.dl_date startOfPeriod, 
            ADDTIME(t1.dl_date,@duration) endOfPeriod, 
           (SELECT COUNT(1) 
            FROM downloads t2 
            WHERE t1.user_id = t2.user_id 
            AND t1.dl_date <= t2.dl_date 
            AND ADDTIME(t1.dl_date,@duration) >= t2.dl_date) count
     FROM downloads t1) t3 
WHERE count > @limit;

Here's the result:

+---------+---------------------+---------------------+-------+
| user_id | startOfPeriod       | endOfPeriod         | count |
+---------+---------------------+---------------------+-------+
|       1 | 2011-01-24 09:00:00 | 2011-01-25 09:00:00 |     6 |
|       1 | 2011-01-24 09:30:00 | 2011-01-25 09:30:00 |     7 |
|       1 | 2011-01-24 09:35:00 | 2011-01-25 09:35:00 |     6 |
|       1 | 2011-01-24 10:00:00 | 2011-01-25 10:00:00 |     6 |
|       2 | 2011-01-24 08:00:00 | 2011-01-25 08:00:00 |    13 |
|       2 | 2011-01-24 12:00:00 | 2011-01-25 12:00:00 |    12 |
|       2 | 2011-01-24 12:01:00 | 2011-01-25 12:01:00 |    10 |
|       2 | 2011-01-24 12:02:00 | 2011-01-25 12:02:00 |     9 |
|       2 | 2011-01-24 12:03:00 | 2011-01-25 12:03:00 |     8 |
|       2 | 2011-01-24 12:00:00 | 2011-01-25 12:00:00 |    12 |
|       2 | 2011-01-24 12:04:00 | 2011-01-25 12:04:00 |     7 |
|       2 | 2011-01-24 12:05:00 | 2011-01-25 12:05:00 |     6 |
+---------+---------------------+---------------------+-------+
12 rows in set (0.00 sec)
like image 142
Patrick Echterbruch Avatar answered Sep 17 '22 23:09

Patrick Echterbruch


This returns a list of user_id which have made more than 100 downloads during any period of 1 day:

SELECT user_id, count(user_id) as downloads_count, DATE(dl_date) 
FROM downloads
GROUP BY user_id, DATE(dl_date)
HAVING count(user_id) > 100;
like image 29
Arnaud Le Blanc Avatar answered Sep 18 '22 23:09

Arnaud Le Blanc