Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grouping Timestamps based on the interval between them

I have a table in Hive (SQL) with a bunch of timestamps that need to be grouped in order to create separate sessions based on the time difference between the timestamps.

Example: Consider the following timestamps(Given in HH:MM for simplicity): 9.00 9.10 9.20 9.40 9.43 10.30 10.45 11.25 12.30 12.33 and so on..

So now, all timestamps that fall within 30 mins of the next timestamp come under the same session, i.e. 9.00,9.10,9.20,9.40,9.43 form 1 session.

But since the difference between 9.43 and 10.30 is more than 30 mins, the time stamp 10.30 falls under a different session. Again, 10.30 and 10.45 fall under one session.

After we have created these sessions, we have to obtain the minimum timestamp for that session and the max timestamp.

I tried to subtract the current timestamp with its LEAD and place a flag if it is greater than 30 mins, but I'm having difficulty with this.

Any suggestion from you guys would be greatly appreciated. Please let me know if the question isn't clear enough.

Expected Output for this sample data:

Session_start   Session_end
9.00                9.43
10.30               10.45
11.25               11.25 (same because the next time is not within 30 mins)
12.30               12.33

Hope this helps.

like image 712
FenderBender Avatar asked Dec 14 '22 17:12

FenderBender


1 Answers

So it's not MySQL but Hive. I don't know Hive, but if it supports LAG, as you say, try this PostgreSQL query. You will probably have to change the time difference calculation, that's usually different from one dbms to another.

select min(thetime) as start_time, max(thetime) as end_time
from
(
  select thetime, count(gap) over (rows between unbounded preceding and current row) as groupid
  from
  (
    select thetime, case when thetime - lag(thetime) over (order by thetime) > interval '30 minutes' then 1 end as gap
    from mytable
  ) times
) groups
group by groupid
order by min(thetime);

The query finds gaps, then uses a running total of gap counts to build group IDs, and the rest is aggregation.

SQL fiddle: http://www.sqlfiddle.com/#!17/8bc4a/6.

like image 132
Thorsten Kettner Avatar answered Jan 12 '23 04:01

Thorsten Kettner