I have login history data from User A for a day. My requirement is that at any point in time the User A can have only one valid login. As in the samples below, the user may have attempted to login successfully multiple times, while his first session was still active. So, any logins that happened during the valid session needs to be flagged as duplicate. Example 1: In the first sample data below, while the user was still logged in from <code>00:12:38</code> to <code>01:00:02 (index 0)</code>, there is another login from the user at <code>00:55:14</code> to <code>01:00:02 (index 1)</code>. Similarly, if we compare <code>index 2</code> and <code>3</code>, we can see that the record at <code>index 3</code> is duplicate login as per requirement. <pre class="prettyprint"><code> start_time end_time 0 00:12:38 01:00:02 1 00:55:14 01:00:02 2 01:00:02 01:32:40 3 01:00:02 01:08:40 4 01:41:22 03:56:23 5 18:58:26 19:16:49 6 20:12:37 20:52:49 7 20:55:16 22:02:50 8 22:21:24 22:48:50 9 23:11:30 00:00:00 </code></pre> Expected output: <pre class="prettyprint"><code> start_time end_time isDup 0 00:12:38 01:00:02 0 1 00:55:14 01:00:02 1 2 01:00:02 01:32:40 0 3 01:00:02 01:08:40 1 4 01:41:22 03:56:23 0 5 18:58:26 19:16:49 0 6 20:12:37 20:52:49 0 7 20:55:16 22:02:50 0 8 22:21:24 22:48:50 0 9 23:11:30 00:00:00 0 </code></pre> These duplicate records need to be updated to 1 at column <code>isDup</code>. <hr> Example 2: Another sample of data as below. Here, while the user was still logged in between <code>13:36:10</code> and <code>13:50:16</code>, there were 3 additional sessions too that needs to be flagged. <pre class="prettyprint"><code> start_time end_time 0 13:32:54 13:32:55 1 13:36:10 13:50:16 2 13:37:54 13:38:14 3 13:46:38 13:46:45 4 13:48:59 13:49:05 5 13:50:16 13:50:20 6 14:03:39 14:03:49 7 15:36:20 15:36:20 8 15:46:47 15:46:47 </code></pre> Expected output: <pre class="prettyprint"><code> start_time end_time isDup 0 13:32:54 13:32:55 0 1 13:36:10 13:50:16 0 2 13:37:54 13:38:14 1 3 13:46:38 13:46:45 1 4 13:48:59 13:49:05 1 5 13:50:16 13:50:20 0 6 14:03:39 14:03:49 0 7 15:36:20 15:36:20 0 8 15:46:47 15:46:47 0 </code></pre> <hr> What's the efficient way to compare the start time of the current record with previous records?

Query <code>duplicated()</code> and change astype to <code>int</code> <pre class="prettyprint"><code>df['isDup']=(df['Start time'].duplicated(False)|df['End time'].duplicated(False)).astype(int) </code></pre> Or did you need <pre class="prettyprint"><code>df['isDup']=(df['Start time'].between(df['Start time'].shift(),df['End time'].shift())).astype(int) </code></pre>

Map the <code>time</code> like values in columns <code>start_time</code> and <code>end_time</code> to pandas <code>TimeDelta</code> objects and subtract <code>1 seconds</code> from the <code>00:00:00</code> timedelta values in <code>end_time</code> column. <pre class="prettyprint"><code>c = ['start_time', 'end_time'] s, e = df[c].astype(str).apply(pd.to_timedelta).to_numpy().T e[e == pd.Timedelta(0)] += pd.Timedelta(days=1, seconds=-1) </code></pre> Then for each pair of <code>start_time</code> and <code>end_time</code> in the dataframe <code>df</code> mark the corresponding duplicate intervals using <code>numpy broadcasting</code>: <pre class="prettyprint"><code>m = (s[:, None] >= s) & (e[:, None] <= e) np.fill_diagonal(m, False) df['isDupe'] = (m.any(1) & ~df[c].duplicated(keep=False)).view('i1') </code></pre> <hr> <pre class="prettyprint"><code># example 1 start_time end_time isDupe 0 00:12:38 01:00:02 0 1 00:55:14 01:00:02 1 2 01:00:02 01:32:40 0 3 01:00:02 01:08:40 1 4 01:41:22 03:56:23 0 5 18:58:26 19:16:49 0 6 20:12:37 20:52:49 0 7 20:55:16 22:02:50 0 8 22:21:24 22:48:50 0 9 23:11:30 00:00:00 0 # example 2 start_time end_time isDupe 0 13:32:54 13:32:55 0 1 13:36:10 13:50:16 0 2 13:37:54 13:38:14 1 3 13:46:38 13:46:45 1 4 13:48:59 13:49:05 1 5 13:50:16 13:50:20 0 6 14:03:39 14:03:49 0 7 15:36:20 15:36:20 0 8 15:46:47 15:46:47 0 </code></pre>

Compare current row value to previous row values

Tags:

python

pandas

dataframe

I have login history data from User A for a day. My requirement is that at any point in time the User A can have only one valid login. As in the samples below, the user may have attempted to login successfully multiple times, while his first session was still active. So, any logins that happened during the valid session needs to be flagged as duplicate.

Example 1:

In the first sample data below, while the user was still logged in from 00:12:38 to 01:00:02 (index 0), there is another login from the user at 00:55:14 to 01:00:02 (index 1).

Similarly, if we compare index 2 and 3, we can see that the record at index 3 is duplicate login as per requirement.

  start_time  end_time
0   00:12:38  01:00:02
1   00:55:14  01:00:02
2   01:00:02  01:32:40
3   01:00:02  01:08:40
4   01:41:22  03:56:23
5   18:58:26  19:16:49
6   20:12:37  20:52:49
7   20:55:16  22:02:50
8   22:21:24  22:48:50
9   23:11:30  00:00:00

Expected output:

  start_time  end_time   isDup
0   00:12:38  01:00:02       0
1   00:55:14  01:00:02       1
2   01:00:02  01:32:40       0
3   01:00:02  01:08:40       1
4   01:41:22  03:56:23       0
5   18:58:26  19:16:49       0
6   20:12:37  20:52:49       0
7   20:55:16  22:02:50       0
8   22:21:24  22:48:50       0
9   23:11:30  00:00:00       0

These duplicate records need to be updated to 1 at column isDup.

Example 2:

Another sample of data as below. Here, while the user was still logged in between 13:36:10 and 13:50:16, there were 3 additional sessions too that needs to be flagged.

  start_time  end_time
0   13:32:54  13:32:55
1   13:36:10  13:50:16
2   13:37:54  13:38:14
3   13:46:38  13:46:45
4   13:48:59  13:49:05
5   13:50:16  13:50:20
6   14:03:39  14:03:49
7   15:36:20  15:36:20
8   15:46:47  15:46:47

Expected output:

  start_time    end_time    isDup
0   13:32:54    13:32:55    0
1   13:36:10    13:50:16    0
2   13:37:54    13:38:14    1
3   13:46:38    13:46:45    1
4   13:48:59    13:49:05    1
5   13:50:16    13:50:20    0
6   14:03:39    14:03:49    0
7   15:36:20    15:36:20    0
8   15:46:47    15:46:47    0

What's the efficient way to compare the start time of the current record with previous records?

305

asked Sep 11 '20 04:09

Saranya Krishnamurthy

2 Answers

Query duplicated() and change astype to int

df['isDup']=(df['Start time'].duplicated(False)|df['End time'].duplicated(False)).astype(int)

Or did you need

df['isDup']=(df['Start time'].between(df['Start time'].shift(),df['End time'].shift())).astype(int)

186

answered Nov 15 '22 04:11

wwnde

Map the time like values in columns start_time and end_time to pandas TimeDelta objects and subtract 1 seconds from the 00:00:00 timedelta values in end_time column.

c = ['start_time', 'end_time']
s, e = df[c].astype(str).apply(pd.to_timedelta).to_numpy().T
e[e == pd.Timedelta(0)] += pd.Timedelta(days=1, seconds=-1)

Then for each pair of start_time and end_time in the dataframe df mark the corresponding duplicate intervals using numpy broadcasting:

m = (s[:, None] >= s) & (e[:, None] <= e)
np.fill_diagonal(m, False)
df['isDupe'] = (m.any(1) & ~df[c].duplicated(keep=False)).view('i1')

# example 1
  start_time  end_time  isDupe
0   00:12:38  01:00:02       0
1   00:55:14  01:00:02       1
2   01:00:02  01:32:40       0
3   01:00:02  01:08:40       1
4   01:41:22  03:56:23       0
5   18:58:26  19:16:49       0
6   20:12:37  20:52:49       0
7   20:55:16  22:02:50       0
8   22:21:24  22:48:50       0
9   23:11:30  00:00:00       0

# example 2
  start_time  end_time  isDupe
0   13:32:54  13:32:55       0
1   13:36:10  13:50:16       0
2   13:37:54  13:38:14       1
3   13:46:38  13:46:45       1
4   13:48:59  13:49:05       1
5   13:50:16  13:50:20       0
6   14:03:39  14:03:49       0
7   15:36:20  15:36:20       0
8   15:46:47  15:46:47       0

answered Nov 15 '22 04:11

Shubham Sharma

Related questions
                            
                                How can I get data from 'ravi' file?
                            
                                Pandas style background gradient not showing in jupyter notebook
                            
                                Extract individual field from table image to excel with OCR
                            
                                How to implement video calls over Django Channels?
                            
                                Is TensorFlow.Data.Dataset the same as DatasetV1Adapter?
                            
                                AttributeError:'bytes' object has no attribute 'encode'
                            
                                How do I annotate a Python function to hint that it takes the same arguments as another function?
                            
                                `yield` inside a recursive procedure
                            
                                Pandas rolling returns NaN when infinity values are involved
                            
                                Difference between predict vs predict_proba in scikit-learn
                            
                                Can't install geopandas with anaconda because of conflicts
                            
                                how to remove negetive value in nested list
                            
                                social-auth-app-django: Refresh access_token
                            
                                How to use refresh token with fastapi?
                            
                                Python ThreadPoolExecutor terminate all threads
                            
                                Unable to send/receive data via HC-12/UART in Python
                            
                                what's the difference of calling a normal function from async function with await a coroutine from an async function?
                            
                                Indexing different sized ranges in a 2D numpy array using a Pythonic vectorized code
                            
                                How do I address "OSError: mysql_config not found" error during Elastic Beanstalk deployment?
                            
                                Plotly: How to display graph after clicking a button?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With