Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas timeseries data - counting unique values during past 24 hours every 30 minutes

I am trying to measure message activity on a forum given the data for each individual message.

To do this, I want to know how many unique/different authors have posted messages in the last 24 hours, at every half-hour (9:00, 9:30, 10:00, ...).

I have a pandas DataFrame to record message information. A message has a post number, the time it was posted, and who wrote it. This is some of my data:

>>> # import pandas as pd
>>> # here df is a pd.DataFrame
>>> print df.loc[:, ['Message Timestamp','Message Author']]

 Post#            Message Timestamp     Message Author
239257    2017-06-09 14:45:46-04:00   JTTLJTTLFBVTNJDF
239258    2017-06-09 14:09:51-04:00        Tvpfrnpvb22
239259    2017-06-09 13:54:13-04:00          Hpzb Tbxb
239260    2017-06-09 13:45:37-04:00      TbnFrbnTrbdfr
239261    2017-06-09 13:28:55-04:00   JTTLJTTLFBVTNJDF
239262    2017-06-09 13:20:23-04:00          njlftlj84
239263    2017-06-09 13:19:59-04:00      TbnFrbnTrbdfr
239264    2017-06-09 13:19:23-04:00   Vjtb Npvb Ttpdlt
239265    2017-06-09 13:15:03-04:00          njlftlj84
239266    2017-06-09 13:06:07-04:00      vndpnnpndfntt
239267    2017-06-09 12:48:54-04:00      TbnFrbnTrbdfr
239268    2017-06-09 12:16:59-04:00       Hrffn n Hpld
239269    2017-06-09 12:06:12-04:00             Xbllfr
239270    2017-06-09 11:27:33-04:00  TbttppfdTrbdfrFrz
239271    2017-06-09 11:21:46-04:00         ND`jn`BjhD
239272    2017-06-09 11:19:34-04:00      TbnFrbnTrbdfr
239273    2017-06-09 10:55:01-04:00      bbndpntfbdfll
239274    2017-06-09 10:55:01-04:00   JTTLJTTLFBVTNJDF
................(continued for years).................

For instance, using the above data, we see that user "JTTLJTTLFBVTNJDF" has posted at least three times in the last twenty four hours; that person only contributes 1 to the "number of unique authors in the last twenty four hours".

The desired output would look like the following (depending on the prior 24 hours of data):

>>> print some_function(df, past='24 hours', every='30 mins')
 Index                        Number_of_unique_authors_in_the_last_24_hours
 2017-06-09 15:00:00-04:00                                               12
 2017-06-09 14:30:00-04:00                                               11
 2017-06-09 14:00:00-04:00                                               13
 ...(and so forth)......

For clarity, this is saying at 15:00, there were 12 different people who had posted messages in the last twenty-four hours, based on the data.

I have tried different combinations of pd.Timestamp.ceil, groupby, and rolling, but nothing close to what I want. An experienced person might know the right combination of things to do.

Also, let me know if someone has a better title idea.


edit: I'm a bit surprised something like below doesn't work.

 series = df.set_index('Message Timestamp')['Message Author']
 series.resample('30 min').rolling('1D').nunique() #not supported

I know how to get a for-loop to do what I want, but it would be nice to find a pandas-powered way of doing things.

like image 730
OrangeSherbet Avatar asked Jan 24 '26 14:01

OrangeSherbet


2 Answers

Consider resample to aggregate to 30 minute intervals and then run nunique. And then run a transform to conditionally aggregate the new unique count column for every 24 hours. You would need to first set the timestamp as index for resample and then back as regular column to aggregate to 24 hours.

import datetime
import pandas as pd
...
df = df[['Message Timestamp', 'Message Author']]

df['24-HourCount'] = df.transform(lambda x:\
          len(df[(df['Message Timestamp'].between(x['Message Timestamp'] - datetime.timedelta(days=1),
                                                  x['Message Timestamp']))]['Message Author'].unique()), axis=1)                                                       
df = df.set_index('Message Timestamp')                                               
df = df[['24-HourCount']].resample('30T').max()

print(df)  
#                      24-HourCount
# Message Timestamp                
# 2017-06-09 14:30:00           2.0
# 2017-06-09 15:00:00           5.0
# 2017-06-09 15:30:00           NaN
# 2017-06-09 16:00:00           7.0
# 2017-06-09 16:30:00           7.0
# 2017-06-09 17:00:00          10.0
# 2017-06-09 17:30:00          11.0
# 2017-06-09 18:00:00          12.0
# 2017-06-09 18:30:00          12.0
like image 135
Parfait Avatar answered Jan 26 '26 06:01

Parfait


I wasn't able to come up with some elegant solution, but a brutal force one by looping through the dataframe, hopefully that will work if your dataset is not huge:

time, unique_count = [], []

for i in range(len(df)):

    time.append(t)
    t = df.ix[i, 'Time']

    #get the datetime of 24 hours ago
    yesterday = t - timedelta(days=1)

    #filter the original dataframe and count unique authors
    count = len(df.ix[(df['Time']<=t) &
            (df['Time']>=yesterday),'Author'].unique())
    unique_count.append(count)


result = pd.DataFrame({'Time': time, 
         'Number_of_unique_authors':unique_count})

Waiting for someone to come up with a more elegant solution.

like image 26
Bubble Bubble Bubble Gut Avatar answered Jan 26 '26 08:01

Bubble Bubble Bubble Gut



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!