I have login history data from User A for a day. My requirement is that at any point in time the User A can have only one valid login. As in the samples below, the user may have attempted to login successfully multiple times, while his first session was still active. So, any logins that happened during the valid session needs to be flagged as duplicate.
Example 1:
In the first sample data below, while the user was still logged in from 00:12:38
to 01:00:02 (index 0)
, there is another login from the user at 00:55:14
to 01:00:02 (index 1)
.
Similarly, if we compare index 2
and 3
, we can see that the record at index 3
is duplicate login as per requirement.
start_time end_time
0 00:12:38 01:00:02
1 00:55:14 01:00:02
2 01:00:02 01:32:40
3 01:00:02 01:08:40
4 01:41:22 03:56:23
5 18:58:26 19:16:49
6 20:12:37 20:52:49
7 20:55:16 22:02:50
8 22:21:24 22:48:50
9 23:11:30 00:00:00
Expected output:
start_time end_time isDup
0 00:12:38 01:00:02 0
1 00:55:14 01:00:02 1
2 01:00:02 01:32:40 0
3 01:00:02 01:08:40 1
4 01:41:22 03:56:23 0
5 18:58:26 19:16:49 0
6 20:12:37 20:52:49 0
7 20:55:16 22:02:50 0
8 22:21:24 22:48:50 0
9 23:11:30 00:00:00 0
These duplicate records need to be updated to 1 at column isDup
.
Example 2:
Another sample of data as below. Here, while the user was still logged in between 13:36:10
and 13:50:16
, there were 3 additional sessions too that needs to be flagged.
start_time end_time
0 13:32:54 13:32:55
1 13:36:10 13:50:16
2 13:37:54 13:38:14
3 13:46:38 13:46:45
4 13:48:59 13:49:05
5 13:50:16 13:50:20
6 14:03:39 14:03:49
7 15:36:20 15:36:20
8 15:46:47 15:46:47
Expected output:
start_time end_time isDup
0 13:32:54 13:32:55 0
1 13:36:10 13:50:16 0
2 13:37:54 13:38:14 1
3 13:46:38 13:46:45 1
4 13:48:59 13:49:05 1
5 13:50:16 13:50:20 0
6 14:03:39 14:03:49 0
7 15:36:20 15:36:20 0
8 15:46:47 15:46:47 0
What's the efficient way to compare the start time of the current record with previous records?
Overview of SQL Server LAG() function In other words, by using the LAG() function, from the current row, you can access data of the previous row, or the row before the previous row, and so on. The LAG() function can be very useful for comparing the value of the current row with the value of the previous row.
LAG provides access to a row at a given physical offset that comes before the current row. Use this analytic function in a SELECT statement to compare values in the current row with values in a previous row.
Query duplicated()
and change astype to int
df['isDup']=(df['Start time'].duplicated(False)|df['End time'].duplicated(False)).astype(int)
Or did you need
df['isDup']=(df['Start time'].between(df['Start time'].shift(),df['End time'].shift())).astype(int)
Map the time
like values in columns start_time
and end_time
to pandas TimeDelta
objects and subtract 1 seconds
from the 00:00:00
timedelta values in end_time
column.
c = ['start_time', 'end_time']
s, e = df[c].astype(str).apply(pd.to_timedelta).to_numpy().T
e[e == pd.Timedelta(0)] += pd.Timedelta(days=1, seconds=-1)
Then for each pair of start_time
and end_time
in the dataframe df
mark the corresponding duplicate intervals using numpy broadcasting
:
m = (s[:, None] >= s) & (e[:, None] <= e)
np.fill_diagonal(m, False)
df['isDupe'] = (m.any(1) & ~df[c].duplicated(keep=False)).view('i1')
# example 1
start_time end_time isDupe
0 00:12:38 01:00:02 0
1 00:55:14 01:00:02 1
2 01:00:02 01:32:40 0
3 01:00:02 01:08:40 1
4 01:41:22 03:56:23 0
5 18:58:26 19:16:49 0
6 20:12:37 20:52:49 0
7 20:55:16 22:02:50 0
8 22:21:24 22:48:50 0
9 23:11:30 00:00:00 0
# example 2
start_time end_time isDupe
0 13:32:54 13:32:55 0
1 13:36:10 13:50:16 0
2 13:37:54 13:38:14 1
3 13:46:38 13:46:45 1
4 13:48:59 13:49:05 1
5 13:50:16 13:50:20 0
6 14:03:39 14:03:49 0
7 15:36:20 15:36:20 0
8 15:46:47 15:46:47 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With