Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

most active time of day based on start and end time

Tags:

mysql

I'm logging statistics of the gamers in my community. For both their online and in-game states I'm registering when they "begin" and when they "end". In order to show the most active day and hour of the day I'd like to use an SQL statement that measures the most active moments based on the "begin" and "end" datetime values.

Looking at SQL - select most 'active' time from db I can see similarities, but I need to also include the moments between the start and end time.

Perhaps the easiest way is to write a cron that does the calculations, but I hope this question might teach me how to address this issue in SQL instead.

I've been searching for an SQL statement that allows to create a datetime period and use that to substract single hours and days. But to no avail.

--- update

As I'm thinking more about this, I'm wondering whether it might be wise to run 24 queries based on each hour of the day (for most active hour) and several queries for the most active day. But that seems like a waste of performance. But this solution might make a query possible like:

SELECT COUNT(`userID`), DATE_FORMAT("%H",started) AS starthour, 
       DATE_FORMAT("%H",ended) AS endhour 
       FROM gameactivity 
       WHERE starthour >= $hour 
             AND endhour <= $hour GROUP BY `userID`

($hour is added for example purposes, of course I'm using PDO. Columns are also just for example purposes, whatever you think is easy for you to use in explaining that is identifiable as start and end is ok with me)

Additional information; PHP 5.5+, PDO, MySQL 5+ Table layout for ingame would be: gameactivity: activityid, userid, gameid, started, ended

DDL:

CREATE TABLE IF NOT EXISTS `steamonlineactivity` (
  `activityID` int(13) NOT NULL AUTO_INCREMENT,
  `userID` varchar(255) NOT NULL,
  `online` datetime DEFAULT NULL,
  `offline` datetime DEFAULT NULL,
  PRIMARY KEY (`activityID`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 AUTO_INCREMENT=1;
like image 763
Luceos Avatar asked Dec 11 '13 09:12

Luceos


3 Answers

If I understood your requirements correctly, if this graph represents user activity:

       Day 
       12/1 12/2 12/3 12/4 ...
Hour 0  xx    x    x   xx
     1   x   xx        xx
     2 xxx    x    x   xx
     3   x              x
     4        x         x
     5   x              x
     6                  x
   ...

You want to know that 02:00 is the time of the day with the highest average activity (a row with 7 x), and 12/4 was most active day (a column with 10 x). Note that this doesn't imply that 02:00 of 12/4 was the most active hour ever, as you can see in the example. If this is not what you want please clarify with concrete examples of input and desired result.

We make a couple assumptions:

  • An activity record can start on one date and finish on the next one. For instance: online 2013-12-02 23:35, offline 2013-12-03 00:13.
  • No activity record has a duration longer than 23 hours, or the number of such records is negligible.

And we need to define what does 'activity' mean. I picked the criteria that were easier to compute in each case. Both can be made more accurate if needed, at the cost of having more complex queries.

  • The most active time of day will be the hour with which more activity records overlap. Note that if a user starts and stops more than once during the hour it will be counted more than once.
  • The most active day will be the one for which there were more unique users that were active at any time of the day.

For the most active time of day we'll use a small auxiliary table holding the 24 possible hours. It can also be generated and joined on the fly with the techniques described in other answers.

CREATE TABLE hour ( hour tinyint not null, primary key(hour) );
INSERT hour (hour)
VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9), (10)
     , (11), (12), (13), (14), (15), (16), (17), (18), (19), (20)
     , (21), (22), (23);

Then the following queries give the required results:

SELECT hour, count(*) AS activity
  FROM steamonlineactivity, hour
 WHERE ( hour BETWEEN hour(online) AND hour(offline)
      OR hour(online) BETWEEN hour(offline) AND hour
      OR hour(offline) BETWEEN hour AND hour(online) )
 GROUP BY hour
 ORDER BY activity DESC;

SELECT date, count(DISTINCT userID) AS activity
  FROM ( 
       SELECT userID, date(online) AS date
         FROM steamonlineactivity
        UNION
       SELECT userID, date(offline) AS date
         FROM steamonlineactivity
   ) AS x
 GROUP BY date
 ORDER BY activity DESC;
like image 132
rsanchez Avatar answered Nov 15 '22 21:11

rsanchez


You need a sequence to get values for hours where there was no activity (e.g. hours where nobody starting or finishing, but there were people on-line who had started but had not finished in that time). Unfortunately there is no nice way to create a sequence in MySQL so you will have to create the sequence manually;

CREATE TABLE `hour_sequence` (
  `ID` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `hour` datetime NOT NULL,
  KEY (`hour`),
  PRIMARY KEY (`ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

# this is not great
INSERT INTO `hour_sequence` (`hour`) VALUES
("2013-12-01 00:00:00"),
("2013-12-01 01:00:00"),
("2013-12-01 02:00:00"),
("2013-12-01 03:00:00"),
("2013-12-01 04:00:00"),
("2013-12-01 05:00:00"),
("2013-12-01 06:00:00"),
("2013-12-01 07:00:00"),
("2013-12-01 08:00:00"),
("2013-12-01 09:00:00"),
("2013-12-01 10:00:00"),
("2013-12-01 11:00:00"),
("2013-12-01 12:00:00");

Now create some test data

CREATE TABLE `log_table` (
  `ID` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `userID` bigint(20) unsigned NOT NULL,
  `started` datetime NOT NULL,
  `finished` datetime NOT NULL,
  KEY (`started`),
  KEY (`finished`),
  PRIMARY KEY (`ID`)
) ENGINE=InnoDB DEFAULT CHARSET latin1;

INSERT INTO `log_table` (`userID`,`started`,`finished`) VALUES
(1, "2013-12-01 00:00:12", "2013-12-01 02:25:00"),
(2, "2013-12-01 07:25:00", "2013-12-01 08:23:00"),
(1, "2013-12-01 04:25:00", "2013-12-01 07:23:00");

Now the query - for every hour we keep a tally (accumulation/running total/integral etc) of how many people started a session hour-on-hour

  SELECT
   HS.hour as period_starting,
   COUNT(LT.userID) AS starts
  FROM `hour_sequence` HS
   LEFT JOIN `log_table` LT ON HS.hour > LT.started
  GROUP BY
   HS.hour

And also how many people went off-line likewise

  SELECT
   HS.hour as period_starting,
   COUNT(LT.userID) AS finishes
  FROM `hour_sequence` HS
   LEFT JOIN `log_table` LT ON HS.hour > LT.finished
  GROUP BY
   HS.hour

By subtracting the accumulation of people that had gone off-line at a point in time from the accumulation of people that have come on-line at that point in time we get the number of people who were on-line at that point in time (presuming there were zero people on-line when the data starts, of course).

SELECT
 starts.period_starting,
 starts.starts as users_started,
 finishes.finishes as users_finished,
 starts.starts - finishes.finishes as users_online

FROM
 (
  SELECT
   HS.hour as period_starting,
   COUNT(LT.userID) AS starts
  FROM `hour_sequence` HS
   LEFT JOIN `log_table` LT ON HS.hour > LT.started
  GROUP BY
   HS.hour
 ) starts

 LEFT JOIN (
  SELECT
   HS.hour as period_starting,
   COUNT(LT.userID) AS finishes
  FROM `hour_sequence` HS
   LEFT JOIN `log_table` LT ON HS.hour > LT.finished
  GROUP BY
   HS.hour
 ) finishes ON starts.period_starting = finishes.period_starting;

Now a few caveats. First of all you will need a process to keep your sequence table populated with the hourly timestamps as time progresses. Additionally the accumulators do not scale well with large amounts of log data due to the tenuous join - it would be wise to constrain access to the log table by timestamp in both the starts and finishes subquery, and the sequence table while you are at it.

  SELECT
   HS.hour as period_starting,
   COUNT(LT.userID) AS finishes
  FROM `hour_sequence` HS
   LEFT JOIN `log_table` LT ON HS.hour > LT.finished
  WHERE
   LT.finished BETWEEN ? AND ? AND HS.hour BETWEEN ? AND ?
  GROUP BY
   HS.hour

If you start constraining your log_table data to specific time ranges bear in mind you will have an offset issue if, at the point you start looking at the log data, there were already people on-line. If there were 1000 people on-line at the point where you start looking at your log data then you threw them all off the server from the query it would look like we went from 0 people on-line to -1000 people on-line!

like image 20
abasterfield Avatar answered Nov 15 '22 20:11

abasterfield


@rsanchez had an amazing answer, but the query for most active time of day has a weird behaviour when handling session times that started and ended on the same hour (a short session). The query seems to calculate them to last for 24 hours.

With trial and error I corrected his query from that part to be following

SELECT hour, count(*) AS activity
FROM steamonlineactivity, hour
WHERE ( hour >= HOUR(online) AND hour <= HOUR(offline)
  OR HOUR(online) > HOUR(offline) AND HOUR(online) <= hour
  OR HOUR(offline) >= hour AND HOUR(offline) < HOUR(online) )
GROUP BY hour
ORDER BY activity DESC;

So with following structure:

CREATE TABLE hour ( hour tinyint not null, primary key(hour) );
INSERT hour (hour)
VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9), (10)
 , (11), (12), (13), (14), (15), (16), (17), (18), (19), (20)
 , (21), (22), (23);

CREATE TABLE `steamonlineactivity` (
  `activityID` int(13) NOT NULL AUTO_INCREMENT,
  `userID` varchar(255) NOT NULL,
  `online` datetime DEFAULT NULL,
  `offline` datetime DEFAULT NULL,
  PRIMARY KEY (`activityID`)
);

INSERT INTO `steamonlineactivity` (`activityID`, `userID`, `online`, `offline`) VALUES
(1, '1',    '2014-01-01 16:01:00',  '2014-01-01 19:01:00'),
(2, '2',    '2014-01-02 16:01:00',  '2014-01-02 19:01:00'),
(3, '3',    '2014-01-01 22:01:00',  '2014-01-02 02:01:00'),
(4, '4',    '2014-01-01 16:01:00',  '2014-01-01 16:05:00');

The top query to get the most active times output following:

+------+----------+
| hour | activity |
+------+----------+
|   16 |        3 |
|   17 |        2 |
|   18 |        2 |
|   19 |        2 |
|   22 |        1 |
|   23 |        1 |
|    0 |        1 |
|    1 |        1 |
|    2 |        1 |
+------+----------+

Instead of the original query which gives following erronous result:

+------+----------+
| hour | activity |
+------+----------+
|   16 |        3 |
|   17 |        3 |
|   18 |        3 |
|   19 |        3 |
|    0 |        2 |
|    1 |        2 |
|    2 |        2 |
|   22 |        2 |
|   23 |        2 |
|   11 |        1 |
|   12 |        1 |
|   13 |        1 |
|   14 |        1 |
|   15 |        1 |
|    3 |        1 |
|    4 |        1 |
|   20 |        1 |
|    5 |        1 |
|   21 |        1 |
|    6 |        1 |
|    7 |        1 |
|    8 |        1 |
|    9 |        1 |
|   10 |        1 |
+------+----------+
like image 1
Zachu Avatar answered Nov 15 '22 21:11

Zachu