I have a table in Hive (SQL) with a bunch of timestamps that need to be grouped in order to create separate sessions based on the time difference between the timestamps.
Example:
Consider the following timestamps(Given in HH:MM for simplicity):
9.00
9.10
9.20
9.40
9.43
10.30
10.45
11.25
12.30
12.33
and so on..
So now, all timestamps that fall within 30 mins of the next timestamp come under the same session, i.e. 9.00,9.10,9.20,9.40,9.43 form 1 session.
But since the difference between 9.43 and 10.30 is more than 30 mins, the time stamp 10.30 falls under a different session. Again, 10.30 and 10.45 fall under one session.
After we have created these sessions, we have to obtain the minimum timestamp for that session and the max timestamp.
I tried to subtract the current timestamp with its LEAD and place a flag if it is greater than 30 mins, but I'm having difficulty with this.
Any suggestion from you guys would be greatly appreciated. Please let me know if the question isn't clear enough.
Expected Output for this sample data:
Session_start Session_end
9.00 9.43
10.30 10.45
11.25 11.25 (same because the next time is not within 30 mins)
12.30 12.33
Hope this helps.
So it's not MySQL but Hive. I don't know Hive, but if it supports LAG, as you say, try this PostgreSQL query. You will probably have to change the time difference calculation, that's usually different from one dbms to another.
select min(thetime) as start_time, max(thetime) as end_time
from
(
select thetime, count(gap) over (rows between unbounded preceding and current row) as groupid
from
(
select thetime, case when thetime - lag(thetime) over (order by thetime) > interval '30 minutes' then 1 end as gap
from mytable
) times
) groups
group by groupid
order by min(thetime);
The query finds gaps, then uses a running total of gap counts to build group IDs, and the rest is aggregation.
SQL fiddle: http://www.sqlfiddle.com/#!17/8bc4a/6.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With