I track web visitors. I store the IP address as well as the timestamp of the visit.
ip_address time_stamp
180.2.79.3 1301654105
180.2.79.3 1301654106
180.2.79.3 1301654354
180.2.79.3 1301654356
180.2.79.3 1301654358
180.2.79.3 1301654366
180.2.79.3 1301654368
180.2.79.3 1301654422
I have a query to get total tracks:
SELECT COUNT(*) AS tracks FROM tracking
However, I now want to disregard visits from users that have visited multiple times within 10 seconds of each visit. Since I don't consider this another visit, its still part of the first visit.
When the ip_address is the same, check timestamp and only count those rows that are 10 seconds away from each other.
I am having difficulty in putting this into a SQL query form, I would appreciate any help on this!
Let me start with this table. I'll use ordinary timestamps so we can easily see what's going on.
180.2.79.3 2011-01-01 08:00:00
180.2.79.3 2011-01-01 08:00:09
180.2.79.3 2011-01-01 08:00:20
180.2.79.3 2011-01-01 08:00:23
180.2.79.3 2011-01-01 08:00:25
180.2.79.3 2011-01-01 08:00:40
180.2.79.4 2011-01-01 08:00:00
180.2.79.4 2011-01-01 08:00:13
180.2.79.4 2011-01-01 08:00:23
180.2.79.4 2011-01-01 08:00:25
180.2.79.4 2011-01-01 08:00:27
180.2.79.4 2011-01-01 08:00:29
180.2.79.4 2011-01-01 08:00:50
If I understand you correctly, you want to count these like this.
180.2.79.3 3
180.2.79.4 3
You can do that for each ip_address by selecting the maximum timestamp that is both
Taking these two criteria together will introduce some nulls, which turn out to be really useful.
select ip_address,
t_s.time_stamp,
(select max(t.time_stamp)
from t_s t
where t.ip_address = t_s.ip_address
and t.time_stamp > t_s.time_stamp
and t.time_stamp - t_s.time_stamp <= interval '10' second) next_page
from t_s
group by ip_address, t_s.time_stamp
order by ip_address, t_s.time_stamp;
ip_address time_stamp next_page
180.2.79.3 2011-01-01 08:00:00 2011-01-01 08:00:09
180.2.79.3 2011-01-01 08:00:09 <null>
180.2.79.3 2011-01-01 08:00:20 2011-01-01 08:00:25
180.2.79.3 2011-01-01 08:00:23 2011-01-01 08:00:25
180.2.79.3 2011-01-01 08:00:25 <null>
180.2.79.3 2011-01-01 08:00:40 <null>
180.2.79.4 2011-01-01 08:00:00 <null>
180.2.79.4 2011-01-01 08:00:13 2011-01-01 08:00:23
180.2.79.4 2011-01-01 08:00:23 2011-01-01 08:00:29
180.2.79.4 2011-01-01 08:00:25 2011-01-01 08:00:29
180.2.79.4 2011-01-01 08:00:27 2011-01-01 08:00:29
180.2.79.4 2011-01-01 08:00:29 <null>
180.2.79.4 2011-01-01 08:00:50 <null>
The timestamp that marks the end of a visit has a null for its own next_page. That's because no timestamp is less than or equal to time_stamp + 10 seconds for that row.
To get a count, I'd probably create a view and count the nulls.
select ip_address, count(*)
from t_s_visits
where next_page is null
group by ip_address
180.2.79.3 3
180.2.79.4 3
You could JOIN
the tracking table to itself and filter out the records you don't need by adding a WHERE
clause.
SELECT t1.ip_address
, COUNT(*) AS tracks
FROM tracking t1
LEFT OUTER JOIN tracking t2 ON t2.ip_address = t1.ip_address
AND t2.time_stamp < t1.time_stamp + 10
WHERE t2.ip_adress IS NULL
GROUP BY
t1.ip_address
Edit
Following script works in SQL Server but I can't express it in a single SQL statement, let alone convert it to MySQL. It might give you some pointers on what is needed though.
Note: I assume for given inputs, number 1 and 11 should get chosen.
;WITH q (number) AS (
SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 10
UNION ALL SELECT 11
UNION ALL SELECT 12
)
SELECT q1.Number as n1
, q2.Number as n2
, 0 as Done
INTO #Temp
FROM q q1
LEFT OUTER JOIN q q2 ON q2.number < q1.number + 10
AND q2.number > q1.number
DECLARE @n1 INTEGER
DECLARE @n2 INTEGER
WHILE EXISTS (SELECT * FROM #Temp WHERE Done = 0)
BEGIN
SELECT TOP 1 @n1 = n1
, @n2= n2
FROM #Temp
WHERE Done = 0
DELETE FROM #Temp
WHERE n1 = @n2
UPDATE #Temp
SET Done = 1
WHERE n1 = @n1
AND n2 = @n2
END
SELECT DISTINCT n1
FROM #Temp
DROP TABLE #Temp
The simplest way to do this is to divide the timestamps by 10, and count the distinct combinations of those values and the ip_address values. That way each 10 second period is counted separately.
If you run this on your sample data it will give you 4 tracks, which is what you want I think.
Give it a try and see if it gives you the desired results on your full data set:
SELECT COUNT(DISTINCT ip_address, FLOOR(time_stamp/10)) AS tracks
FROM tracking
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With