Pandas timeseries data - counting unique values during past 24 hours every 30 minutes

Question

I am trying to measure message activity on a forum given the data for each individual message.

To do this, I want to know how many unique/different authors have posted messages in the last 24 hours, at every half-hour (9:00, 9:30, 10:00, ...).

I have a pandas DataFrame to record message information. A message has a post number, the time it was posted, and who wrote it. This is some of my data:

>>> # import pandas as pd
>>> # here df is a pd.DataFrame
>>> print df.loc[:, ['Message Timestamp','Message Author']]

 Post#            Message Timestamp     Message Author
239257    2017-06-09 14:45:46-04:00   JTTLJTTLFBVTNJDF
239258    2017-06-09 14:09:51-04:00        Tvpfrnpvb22
239259    2017-06-09 13:54:13-04:00          Hpzb Tbxb
239260    2017-06-09 13:45:37-04:00      TbnFrbnTrbdfr
239261    2017-06-09 13:28:55-04:00   JTTLJTTLFBVTNJDF
239262    2017-06-09 13:20:23-04:00          njlftlj84
239263    2017-06-09 13:19:59-04:00      TbnFrbnTrbdfr
239264    2017-06-09 13:19:23-04:00   Vjtb Npvb Ttpdlt
239265    2017-06-09 13:15:03-04:00          njlftlj84
239266    2017-06-09 13:06:07-04:00      vndpnnpndfntt
239267    2017-06-09 12:48:54-04:00      TbnFrbnTrbdfr
239268    2017-06-09 12:16:59-04:00       Hrffn n Hpld
239269    2017-06-09 12:06:12-04:00             Xbllfr
239270    2017-06-09 11:27:33-04:00  TbttppfdTrbdfrFrz
239271    2017-06-09 11:21:46-04:00         ND`jn`BjhD
239272    2017-06-09 11:19:34-04:00      TbnFrbnTrbdfr
239273    2017-06-09 10:55:01-04:00      bbndpntfbdfll
239274    2017-06-09 10:55:01-04:00   JTTLJTTLFBVTNJDF
................(continued for years).................

For instance, using the above data, we see that user "JTTLJTTLFBVTNJDF" has posted at least three times in the last twenty four hours; that person only contributes 1 to the "number of unique authors in the last twenty four hours".

The desired output would look like the following (depending on the prior 24 hours of data):

>>> print some_function(df, past='24 hours', every='30 mins')
 Index                        Number_of_unique_authors_in_the_last_24_hours
 2017-06-09 15:00:00-04:00                                               12
 2017-06-09 14:30:00-04:00                                               11
 2017-06-09 14:00:00-04:00                                               13
 ...(and so forth)......

For clarity, this is saying at 15:00, there were 12 different people who had posted messages in the last twenty-four hours, based on the data.

I have tried different combinations of pd.Timestamp.ceil, groupby, and rolling, but nothing close to what I want. An experienced person might know the right combination of things to do.

Also, let me know if someone has a better title idea.

edit: I'm a bit surprised something like below doesn't work.

 series = df.set_index('Message Timestamp')['Message Author']
 series.resample('30 min').rolling('1D').nunique() #not supported

I know how to get a for-loop to do what I want, but it would be nice to find a pandas-powered way of doing things.

Parfait · Accepted Answer

Consider resample to aggregate to 30 minute intervals and then run nunique. And then run a transform to conditionally aggregate the new unique count column for every 24 hours. You would need to first set the timestamp as index for resample and then back as regular column to aggregate to 24 hours.

import datetime
import pandas as pd
...
df = df[['Message Timestamp', 'Message Author']]

df['24-HourCount'] = df.transform(lambda x:\
          len(df[(df['Message Timestamp'].between(x['Message Timestamp'] - datetime.timedelta(days=1),
                                                  x['Message Timestamp']))]['Message Author'].unique()), axis=1)                                                       
df = df.set_index('Message Timestamp')                                               
df = df[['24-HourCount']].resample('30T').max()

print(df)  
#                      24-HourCount
# Message Timestamp                
# 2017-06-09 14:30:00           2.0
# 2017-06-09 15:00:00           5.0
# 2017-06-09 15:30:00           NaN
# 2017-06-09 16:00:00           7.0
# 2017-06-09 16:30:00           7.0
# 2017-06-09 17:00:00          10.0
# 2017-06-09 17:30:00          11.0
# 2017-06-09 18:00:00          12.0
# 2017-06-09 18:30:00          12.0

Bubble Bubble Bubble Gut · Answer

I wasn't able to come up with some elegant solution, but a brutal force one by looping through the dataframe, hopefully that will work if your dataset is not huge:

time, unique_count = [], []

for i in range(len(df)):

    time.append(t)
    t = df.ix[i, 'Time']

    #get the datetime of 24 hours ago
    yesterday = t - timedelta(days=1)

    #filter the original dataframe and count unique authors
    count = len(df.ix[(df['Time']<=t) &
            (df['Time']>=yesterday),'Author'].unique())
    unique_count.append(count)


result = pd.DataFrame({'Time': time, 
         'Number_of_unique_authors':unique_count})

Waiting for someone to come up with a more elegant solution.

Pandas timeseries data - counting unique values during past 24 hours every 30 minutes

Tags:

python

pandas

python-2.7

pandas-groupby

time-series

OrangeSherbet

2 Answers

Parfait

Bubble Bubble Bubble Gut

Recent Activity

Donate For Us

Pandas timeseries data - counting unique values during past 24 hours every 30 minutes

Tags:

python

pandas

python-2.7

pandas-groupby

time-series

OrangeSherbet

2 Answers

Parfait

Bubble Bubble Bubble Gut

Related questions

Recent Activity

Donate For Us