Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove same values based on time using interval?

I have a data set as following:

Name |        Time     |   App 
---------------------------------
Mike  2019-05-10 21:10 chrome.exe
Mike  2019-05-10 21:10 chrome.exe
Mike  2019-05-10 21:12 chrome.exe
John  2019-05-10 18:09 chrome.exe
John  2019-05-10 18:25 chrome.exe

My goal is this: I want to combine same apps for each user based on same time or if it is in 5 minute interval and saving only earliest time stamp.

Expected output:

Name |        Time     |   App 
---------------------------------
Mike  2019-05-10 21:10 chrome.exe
John  2019-05-10 18:09 chrome.exe
John  2019-05-10 18:25 chrome.exe

Mike had run chrome.exe 3 times but the interval was <= 5 so we want to count it as once. While John ran chrome.exe 2 times but > 5 minute interval so they count as separate runs. I have tried merge, merge_asof and using pd.timedelta.

like image 980
MichaelMMeskhi Avatar asked May 10 '19 22:05

MichaelMMeskhi


2 Answers

Let's first create example data frame (bit different from yours):

data = [('2019-01-01 13:00','John', 'Chrome'),('2019-01-01 13:02','John', 'Chrome'),('2019-01-01 13:06','John', 'Chrome'),('2019-01-01 13:00','Mike', 'Chrome'),('2019-01-01 13:02','Mike', 'Chrome'), ('2019-01-01 13:06','John', 'Chrome')]
df = pd.DataFrame(data, columns =['Time','Name','App'])

You'll need time as index in your dataframe. You can achieve this by:

df.index = pd.to_datetime(df['Time'])

Then you can do the following:

df.groupby(['Name', 'App', pd.Grouper(freq='5T')]).min() #5T here means 5 minutes

(note that this will groupby five minute intervals, starting at full hour, meaning that 13:04 and 13:06 are two different occurences as far as this solution is concerned). You can look up different frequencies to group by http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

Result:

Name  App     Time               
John  Chrome  2019-01-01 13:00:00    2019-01-01 13:00
              2019-01-01 13:05:00    2019-01-01 13:06
Mike  Chrome  2019-01-01 13:00:00    2019-01-01 13:00

Second time is the time you were interested in. The result is pd.Series, you'll likely want to make it a dataframe, or apply .unstack(level=1).

As it was pointed out in the comments, you may not necessarily want to have Time both as an index and new column, then instead of df.index = pd.to_datetime(df.Time) you might want to do:

df.set_index('Time', inplace=True)
like image 119
lisu Avatar answered Nov 08 '22 11:11

lisu


You may try this:

df['Time'] = pd.to_datetime(df['Time'])
print(df)
m = df.groupby(['Name','App']).Time.apply(lambda x: x.diff().dt.seconds < 5*60) #mask for 5 minutes
df2=df[~m]
print(df2)

What is m. It is the mask of your duplicates that we try to remove, but instead removing it inplace I just assigned a new dataframe without these rows.


   Name                Time          App
0  Mike 2019-05-10 21:10:00   chrome.exe
1  Mike 2019-05-10 21:10:00   chrome.exe
2  Mike 2019-05-10 21:12:00   chrome.exe
3  John 2019-05-10 18:09:00   chrome.exe
4  John 2019-05-10 18:25:00  chrome.exe7
   Name                Time          App
0  Mike 2019-05-10 21:10:00   chrome.exe
3  John 2019-05-10 18:09:00   chrome.exe
4  John 2019-05-10 18:25:00  chrome.exe7
like image 1
prosti Avatar answered Nov 08 '22 11:11

prosti