Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MySQL: group by consecutive days and count groups

I have a database table which holds each user's checkins in cities. I need to know how many days a user has been in a city, and then, how many visits a user has made to a city (a visit consists of consecutive days spent in a city).

So, consider I have the following table (simplified, containing only the DATETIMEs - same user and city):

      datetime
-------------------
2011-06-30 12:11:46
2011-07-01 13:16:34
2011-07-01 15:22:45
2011-07-01 22:35:00
2011-07-02 13:45:12
2011-08-01 00:11:45
2011-08-05 17:14:34
2011-08-05 18:11:46
2011-08-06 20:22:12

The number of days this user has been to this city would be 6 (30.06, 01.07, 02.07, 01.08, 05.08, 06.08).

I thought of doing this using SELECT COUNT(id) FROM table GROUP BY DATE(datetime)

Then, for the number of visits this user has made to this city, the query should return 3 (30.06-02.07, 01.08, 05.08-06.08).

The problem is that I have no idea how shall I build this query.

Any help would be highly appreciated!

like image 308
linkyndy Avatar asked Aug 17 '11 13:08

linkyndy


3 Answers

You can find the first day of each visit by finding checkins where there was no checkin the day before.

select count(distinct date(start_of_visit.datetime))
from checkin start_of_visit
left join checkin previous_day
    on start_of_visit.user = previous_day.user
    and start_of_visit.city = previous_day.city
    and date(start_of_visit.datetime) - interval 1 day = date(previous_day.datetime)
where previous_day.id is null

There are several important parts to this query.

First, each checkin is joined to any checkin from the previous day. But since it's an outer join, if there was no checkin the previous day the right side of the join will have NULL results. The WHERE filtering happens after the join, so it keeps only those checkins from the left side where there are none from the right side. LEFT OUTER JOIN/WHERE IS NULL is really handy for finding where things aren't.

Then it counts distinct checkin dates to make sure it doesn't double-count if the user checked in multiple times on the first day of the visit. (I actually added that part on edit, when I spotted the possible error.)

Edit: I just re-read your proposed query for the first question. Your query would get you the number of checkins on a given date, instead of a count of dates. I think you want something like this instead:

select count(distinct date(datetime))
from checkin
where user='some user' and city='some city'
like image 62
Simon Avatar answered Nov 12 '22 07:11

Simon


Try to apply this code to your task -

CREATE TABLE visits(
  user_id INT(11) NOT NULL,
  dt DATETIME DEFAULT NULL
);

INSERT INTO visits VALUES 
  (1, '2011-06-30 12:11:46'),
  (1, '2011-07-01 13:16:34'),
  (1, '2011-07-01 15:22:45'),
  (1, '2011-07-01 22:35:00'),
  (1, '2011-07-02 13:45:12'),
  (1, '2011-08-01 00:11:45'),
  (1, '2011-08-05 17:14:34'),
  (1, '2011-08-05 18:11:46'),
  (1, '2011-08-06 20:22:12'),
  (2, '2011-08-30 16:13:34'),
  (2, '2011-08-31 16:13:41');


SET @i = 0;
SET @last_dt = NULL;
SET @last_user = NULL;

SELECT v.user_id,
  COUNT(DISTINCT(DATE(dt))) number_of_days,
  MAX(days) number_of_visits
FROM
  (SELECT user_id, dt
        @i := IF(@last_user IS NULL OR @last_user <> user_id, 1, IF(@last_dt IS NULL OR (DATE(dt) - INTERVAL 1 DAY) > DATE(@last_dt), @i + 1, @i)) AS days,
        @last_dt := DATE(dt),
        @last_user := user_id
   FROM
     visits
   ORDER BY
     user_id, dt
  ) v
GROUP BY
  v.user_id;

----------------
Output:

+---------+----------------+------------------+
| user_id | number_of_days | number_of_visits |
+---------+----------------+------------------+
|       1 |              6 |                3 |
|       2 |              2 |                1 |
+---------+----------------+------------------+

Explanation:

To understand how it works let's check the subquery, here it is.

SET @i = 0;
SET @last_dt = NULL;
SET @last_user = NULL;


SELECT user_id, dt,
        @i := IF(@last_user IS NULL OR @last_user <> user_id, 1, IF(@last_dt IS NULL OR (DATE(dt) - INTERVAL 1 DAY) > DATE(@last_dt), @i + 1, @i)) AS 

days,
        @last_dt := DATE(dt) lt,
        @last_user := user_id lu
FROM
  visits
ORDER BY
  user_id, dt;

As you see the query returns all rows and performs ranking for the number of visits. This is known ranking method based on variables, note that rows are ordered by user and date fields. This query calculates user visits, and outputs next data set where days column provides rank for the number of visits -

+---------+---------------------+------+------------+----+
| user_id | dt                  | days | lt         | lu |
+---------+---------------------+------+------------+----+
|       1 | 2011-06-30 12:11:46 |    1 | 2011-06-30 |  1 |
|       1 | 2011-07-01 13:16:34 |    1 | 2011-07-01 |  1 |
|       1 | 2011-07-01 15:22:45 |    1 | 2011-07-01 |  1 |
|       1 | 2011-07-01 22:35:00 |    1 | 2011-07-01 |  1 |
|       1 | 2011-07-02 13:45:12 |    1 | 2011-07-02 |  1 |
|       1 | 2011-08-01 00:11:45 |    2 | 2011-08-01 |  1 |
|       1 | 2011-08-05 17:14:34 |    3 | 2011-08-05 |  1 |
|       1 | 2011-08-05 18:11:46 |    3 | 2011-08-05 |  1 |
|       1 | 2011-08-06 20:22:12 |    3 | 2011-08-06 |  1 |
|       2 | 2011-08-30 16:13:34 |    1 | 2011-08-30 |  2 |
|       2 | 2011-08-31 16:13:41 |    1 | 2011-08-31 |  2 |
+---------+---------------------+------+------------+----+

Then we group this data set by user and use aggregate functions: 'COUNT(DISTINCT(DATE(dt)))' - counts the number of days 'MAX(days)' - the number of visits, it is a maximum value for the days field from our subquery.

That is all;)

like image 34
Devart Avatar answered Nov 12 '22 07:11

Devart


As data sample provided by Devart, the inner "PreQuery" works with sql variables. By defaulting the @LUser to a -1 (probable non-existent user ID), the IF() test checks for any difference between last user and current. As soon as a new user, it gets a value of 1... Additionally, if the last date is more than 1 day from the new date of check-in, it gets a value of 1. Then, the subsequent columns reset the @LUser and @LDate to the value of the incoming record just tested against for the next cycle. Then, the outer query just sums them up and counts them for the final correct results per the Devart data set of

User ID    Distinct Visits   Total Days
1           3                 9
2           1                 2

select PreQuery.User_ID,
       sum( PreQuery.NextVisit ) as DistinctVisits,
       count(*) as TotalDays
   from
      (  select v.user_id,
               if( @LUser <> v.User_ID OR @LDate < ( date( v.dt ) - Interval 1 day ), 1, 0 ) as NextVisit,
               @LUser := v.user_id,
               @LDate := date( v.dt )
            from 
               Visits v,
               ( select @LUser := -1, @LDate := date(now()) ) AtVars 
            order by
               v.user_id,
               v.dt  ) PreQuery
    group by 
       PreQuery.User_ID
like image 1
DRapp Avatar answered Nov 12 '22 06:11

DRapp