Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python pandas: getting session start and end time to calculate session length

I have the below data frame which is sorted by user and timestamp (written as an integer here to make it easier).

I've added a column which gives the timedifference from the previous activity in minutes using pandas diff(). I'm defining actions as belonging to the same session if they happen within 30 minutes of each other. Finding new sessions is easy then, as I can just look at if timediff is equal to 'NaT' or greater than 30.

d = {'id': [123,  123, 123, 123, 123, 123, 234, 234],
     'activity': ['view','click','click','view','click','view', 'click', 'view'],
     'timestamp': [1, 2,3,4,5,6,1,2],
     'timediff_min': ['NaT',1,36,2,6,124,'NaT',1],
     'new_session': [1,0,1,0,0,1,1,0]}

df = pd.DataFrame(d)
df

This yields, the 'new_session' column. Now I can filter down to get a dataframe with the timestamp of session starts, but I would like to get the timestamp of the final activity to be able to calculate session length. So basically, if there is a single activity session start and session end time will be the same, but if there is more than one in the same session, session start will be the first activity, and session end will be the final activity before the next session starts. So the final output would be something like this

d2 = {'id': [123,   123,  123, 234, ],
     'activity': ['view','click','view', 'click'] ,
     'timestamp': [1, 3,6,1],
     'timediff_min': ['NaT',36,124,'NaT'],
     'new_session': [1,1,1,1,],
     'session_start': [1,3,6,1],
     'session_end': [2,5,6,2],}
pd.DataFrame(d2)

Any help would be appreciated. Thanks!

like image 914
L Xandor Avatar asked Dec 20 '25 21:12

L Xandor


1 Answers

I solved this by using the following approach

d['time_diff'] = d.groupby('id')['timestamp'].diff()
d['new_sess'] = np.where((d.time_diff.isnull()) | (d.time_diff > 'P0DT0H30M0S'), 'yes', 'no')
new_sessions = np.where((d.time_diff.isnull()) | (d.time_diff > 'P0DT0H30M0S'))
d['sess_count'] = np.NaN
d.iloc[new_sessions[0],9] = new_sessions[0]
d.fillna(method='ffill', inplace = True)
d['sess_id'] = d.id + '-' + d.sess_count.astype(int).astype(str)

This creates unique session ids, that I can then group to get min and max timestamps.

like image 195
L Xandor Avatar answered Dec 23 '25 10:12

L Xandor