I have a (very simplyfied here) pandas dataframe which looks like this:
df
datetime user type msg
0 2012-11-11 15:41:08 u1 txt hello world
1 2012-11-11 15:41:11 u2 txt hello world
2 2012-11-21 17:00:08 u3 txt hello world
3 2012-11-22 18:08:35 u4 txt hello you
4 2012-11-22 18:08:37 u5 txt hello you
What I would like to do now is to get all the duplicate messages which have their timestamp within 3 seconds. The desired output would be:
datetime user type msg
0 2012-11-11 15:41:08 u1 txt hello world
1 2012-11-11 15:41:11 u2 txt hello world
3 2012-11-22 18:08:35 u4 txt hello you
4 2012-11-22 18:08:37 u5 txt hello you
without the third row, as its text is the same as in row one and two, but its timestamp is not within the range of 3 seconds.
I tried to define the columns datetime and msg as parameters for the duplicate()
method, but it returns an empty dataframe because the timestamps are not identical:
mask = df.duplicated(subset=['datetime', 'msg'], keep=False)
print(df[mask])
Empty DataFrame
Columns: [datetime, user, type, msg, MD5]
Index: []
Is there a way where I can define a range for my "datetime" parameter? To illustrate, something like:
mask = df.duplicated(subset=['datetime_between_3_seconds', 'msg'], keep=False)
Any help here would as always be very much appreciated.
This Piece of code gives the expected output
df[(df.groupby(["msg"], as_index=False)["datetime"].diff().fillna(0).dt.seconds <= 3).reset_index(drop=True)]
I have grouped on "msg" column of dataframe and then selected "datetime" column of that dataframe and used inbuilt function diff. Diff function finds the difference between values of that column. Filled the NaT values with zero and selected only those indexes which have values less than 3 seconds.
Before using above code make sure that your dataframe is sorted on datetime in ascending order.
This bit of code works on your example data, although you might have to play around with any extreme cases.
From your question I'm assuming you want to filter out messages from the first time it appears in df. It won't work if you have instances where you want to keep the string if it appears again after another threshold.
In short I wrote a function that will take your dataframe and the 'msg' to filter for. It takes the timestamp of the first time the message appears and compares that to all the other times it appears.
It then selects only the instances where it appears within 3 seconds of the first appearance.
import numpy as np
import pandas as pd
#function which will return dataframe containing messages within three seconds of the first message
def get_info_within_3seconds(df, msg):
df_of_msg = df[df['msg']==msg].sort_values(by = 'datetime')
t1 = df_of_msg['datetime'].reset_index(drop = True)[0]
datetime_deltas = [(i -t1).total_seconds() for i in df_of_msg['datetime']]
filter_list = [i <= 3.0 for i in datetime_deltas]
return df_of_msg[filter_list]
msgs = df['msg'].unique()
#apply function to each unique message and then create a new df
new_df = pd.concat([get_info_within_3seconds(df, i) for i in msgs])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With