I have two dataframes: let's call them group_user_log and group_user
group_user_log
user_id server_time session_id
1 2018-01-01 435
1 2018-01-01 435
1 2018-01-04 675
1 2018-01-05 454
1 2018-01-05 454
1 2018-01-06 920
group_train
user_id impression_time totalcount distinct_count
1 2018-01-03 0 0
1 2018-01-05 0 0
Logic is to pull total and distinct count of session_id from group_user_log where server_time is less than impression_time and populate the total and distinct count columns. Expected output for group_train is:
user_id impression_time totalcount distinct_count
1 2018-01-03 2 1
1 2018-01-05 3 2
I tried doing it row-by-row but that is time consuming and very inefficient for larger dataframes because above data is a subset for a particular user_id from two large dataframes and such calculation needs to be done for a large number of user_id so I am looking to make it efficient.
Thanks for your help!!
With groupby , merge and query:
#merge on user_id and query for server_time<impression_time
m=group_user_log.merge(group_train,on='user_id').query('server_time<impression_time')
#groupby on user_id and impression_time and agg on size and nunique
(m.groupby(['user_id','impression_time'])['session_id'].agg(['size','nunique'])
.rename(columns={'size':'totalcount','nunique':'distinct_count'}))
totalcount distinct_count
user_id impression_time
1 2018-01-03 2 1
2018-01-05 3 2
You can then use this to update the group_train by setting user_id and impression_time as index:
group_train=group_train.set_index(['user_id','impression_time'])
group_train.update(m)
print(group_train) #.reset_index()
totalcount distinct_count
user_id impression_time
1 2018-01-03 2 1
2018-01-05 3 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With