Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count number of rows that are not within 10 seconds of each other

Tags:

sql

mysql

I track web visitors. I store the IP address as well as the timestamp of the visit.

ip_address    time_stamp
180.2.79.3  1301654105
180.2.79.3  1301654106
180.2.79.3  1301654354
180.2.79.3  1301654356
180.2.79.3  1301654358
180.2.79.3  1301654366
180.2.79.3  1301654368
180.2.79.3  1301654422

I have a query to get total tracks:

SELECT COUNT(*) AS tracks FROM tracking

However, I now want to disregard visits from users that have visited multiple times within 10 seconds of each visit. Since I don't consider this another visit, its still part of the first visit.

When the ip_address is the same, check timestamp and only count those rows that are 10 seconds away from each other.

I am having difficulty in putting this into a SQL query form, I would appreciate any help on this!

like image 771
Abs Avatar asked Apr 01 '11 11:04

Abs


3 Answers

Let me start with this table. I'll use ordinary timestamps so we can easily see what's going on.

180.2.79.3   2011-01-01 08:00:00
180.2.79.3   2011-01-01 08:00:09
180.2.79.3   2011-01-01 08:00:20
180.2.79.3   2011-01-01 08:00:23
180.2.79.3   2011-01-01 08:00:25
180.2.79.3   2011-01-01 08:00:40
180.2.79.4   2011-01-01 08:00:00
180.2.79.4   2011-01-01 08:00:13
180.2.79.4   2011-01-01 08:00:23
180.2.79.4   2011-01-01 08:00:25
180.2.79.4   2011-01-01 08:00:27
180.2.79.4   2011-01-01 08:00:29
180.2.79.4   2011-01-01 08:00:50

If I understand you correctly, you want to count these like this.

180.2.79.3   3
180.2.79.4   3

You can do that for each ip_address by selecting the maximum timestamp that is both

  • greater than the current row's timestamp, and
  • less than or equal to 10 seconds greater than the current row's timestamp.

Taking these two criteria together will introduce some nulls, which turn out to be really useful.

select ip_address, 
       t_s.time_stamp, 
       (select max(t.time_stamp) 
        from t_s t 
        where t.ip_address = t_s.ip_address 
          and t.time_stamp > t_s.time_stamp
          and t.time_stamp - t_s.time_stamp <= interval '10' second) next_page
from t_s 
group by ip_address, t_s.time_stamp
order by ip_address, t_s.time_stamp;

ip_address   time_stamp            next_page
180.2.79.3   2011-01-01 08:00:00   2011-01-01 08:00:09
180.2.79.3   2011-01-01 08:00:09   <null>
180.2.79.3   2011-01-01 08:00:20   2011-01-01 08:00:25
180.2.79.3   2011-01-01 08:00:23   2011-01-01 08:00:25
180.2.79.3   2011-01-01 08:00:25   <null>
180.2.79.3   2011-01-01 08:00:40   <null>
180.2.79.4   2011-01-01 08:00:00   <null>
180.2.79.4   2011-01-01 08:00:13   2011-01-01 08:00:23
180.2.79.4   2011-01-01 08:00:23   2011-01-01 08:00:29
180.2.79.4   2011-01-01 08:00:25   2011-01-01 08:00:29
180.2.79.4   2011-01-01 08:00:27   2011-01-01 08:00:29
180.2.79.4   2011-01-01 08:00:29   <null>
180.2.79.4   2011-01-01 08:00:50   <null>

The timestamp that marks the end of a visit has a null for its own next_page. That's because no timestamp is less than or equal to time_stamp + 10 seconds for that row.

To get a count, I'd probably create a view and count the nulls.

select ip_address, count(*)
from t_s_visits 
where next_page is null
group by ip_address

180.2.79.3   3
180.2.79.4   3
like image 152
Mike Sherrill 'Cat Recall' Avatar answered Nov 18 '22 23:11

Mike Sherrill 'Cat Recall'


You could JOIN the tracking table to itself and filter out the records you don't need by adding a WHEREclause.

SELECT  t1.ip_address
        , COUNT(*) AS tracks
FROM    tracking t1
        LEFT OUTER JOIN tracking t2 ON t2.ip_address = t1.ip_address
                                       AND t2.time_stamp < t1.time_stamp + 10
WHERE   t2.ip_adress IS NULL
GROUP BY
        t1.ip_address

Edit

Following script works in SQL Server but I can't express it in a single SQL statement, let alone convert it to MySQL. It might give you some pointers on what is needed though.

Note: I assume for given inputs, number 1 and 11 should get chosen.

;WITH q (number) AS (
  SELECT 1
  UNION ALL SELECT 2
  UNION ALL SELECT 10
  UNION ALL SELECT 11  
  UNION ALL SELECT 12
)
SELECT  q1.Number as n1
        , q2.Number as n2
        , 0 as Done
INTO    #Temp
FROM    q q1
        LEFT OUTER JOIN q q2 ON q2.number < q1.number + 10
                                AND q2.number > q1.number

DECLARE @n1 INTEGER
DECLARE @n2 INTEGER

WHILE EXISTS (SELECT * FROM #Temp WHERE Done = 0)
BEGIN

  SELECT  TOP 1 @n1 = n1
          , @n2= n2
  FROM    #Temp
  WHERE   Done = 0

  DELETE  FROM #Temp
  WHERE   n1 = @n2

  UPDATE  #Temp 
  SET     Done = 1
  WHERE   n1 = @n1 
          AND n2 = @n2         
END        

SELECT  DISTINCT n1 
FROM    #Temp

DROP TABLE #Temp
like image 6
Lieven Keersmaekers Avatar answered Nov 19 '22 00:11

Lieven Keersmaekers


The simplest way to do this is to divide the timestamps by 10, and count the distinct combinations of those values and the ip_address values. That way each 10 second period is counted separately.

If you run this on your sample data it will give you 4 tracks, which is what you want I think.

Give it a try and see if it gives you the desired results on your full data set:

SELECT COUNT(DISTINCT ip_address, FLOOR(time_stamp/10)) AS tracks 
FROM tracking
like image 2
Ike Walker Avatar answered Nov 18 '22 23:11

Ike Walker