I have a data set as following:
Name | Time | App
---------------------------------
Mike 2019-05-10 21:10 chrome.exe
Mike 2019-05-10 21:10 chrome.exe
Mike 2019-05-10 21:12 chrome.exe
John 2019-05-10 18:09 chrome.exe
John 2019-05-10 18:25 chrome.exe
My goal is this: I want to combine same apps for each user based on same time or if it is in 5 minute interval and saving only earliest time stamp.
Expected output:
Name | Time | App
---------------------------------
Mike 2019-05-10 21:10 chrome.exe
John 2019-05-10 18:09 chrome.exe
John 2019-05-10 18:25 chrome.exe
Mike had run chrome.exe
3 times but the interval was <= 5 so we want to count it as once. While John ran chrome.exe
2 times but > 5 minute interval so they count as separate runs. I have tried merge
, merge_asof
and using pd.timedelta
.
Let's first create example data frame (bit different from yours):
data = [('2019-01-01 13:00','John', 'Chrome'),('2019-01-01 13:02','John', 'Chrome'),('2019-01-01 13:06','John', 'Chrome'),('2019-01-01 13:00','Mike', 'Chrome'),('2019-01-01 13:02','Mike', 'Chrome'), ('2019-01-01 13:06','John', 'Chrome')]
df = pd.DataFrame(data, columns =['Time','Name','App'])
You'll need time as index in your dataframe. You can achieve this by:
df.index = pd.to_datetime(df['Time'])
Then you can do the following:
df.groupby(['Name', 'App', pd.Grouper(freq='5T')]).min() #5T here means 5 minutes
(note that this will groupby five minute intervals, starting at full hour, meaning that 13:04 and 13:06 are two different occurences as far as this solution is concerned). You can look up different frequencies to group by http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html
Result:
Name App Time
John Chrome 2019-01-01 13:00:00 2019-01-01 13:00
2019-01-01 13:05:00 2019-01-01 13:06
Mike Chrome 2019-01-01 13:00:00 2019-01-01 13:00
Second time is the time you were interested in. The result is pd.Series, you'll likely want to make it a dataframe, or apply .unstack(level=1)
.
As it was pointed out in the comments, you may not necessarily want to have Time both as an index and new column, then instead of df.index = pd.to_datetime(df.Time)
you might want to do:
df.set_index('Time', inplace=True)
You may try this:
df['Time'] = pd.to_datetime(df['Time'])
print(df)
m = df.groupby(['Name','App']).Time.apply(lambda x: x.diff().dt.seconds < 5*60) #mask for 5 minutes
df2=df[~m]
print(df2)
What is m
. It is the mask of your duplicates that we try to remove, but instead removing it inplace I just assigned a new dataframe without these rows.
Name Time App
0 Mike 2019-05-10 21:10:00 chrome.exe
1 Mike 2019-05-10 21:10:00 chrome.exe
2 Mike 2019-05-10 21:12:00 chrome.exe
3 John 2019-05-10 18:09:00 chrome.exe
4 John 2019-05-10 18:25:00 chrome.exe7
Name Time App
0 Mike 2019-05-10 21:10:00 chrome.exe
3 John 2019-05-10 18:09:00 chrome.exe
4 John 2019-05-10 18:25:00 chrome.exe7
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With