I would like to look at TimeSeries
data for every client over various time periods in Pandas
.
import pandas as pd
import numpy as np
import random
clients = np.random.randint(1, 11, size=100)
dates = pd.date_range('20130101',periods=365)
OrderDates = random.sample(list(dates),100)
Values = np.random.randint(10, 250, size=100)
df = pd.DataFrame({ 'Client' : clients,'OrderDate' : OrderDates, 'Value' : Values})
df = df.sort_values(['OrderDate', 'Client'], ascending=['True', 'True'])
df.head()
# Client OrderDate Value
# 36 3 2013-01-11 40
# 55 4 2013-01-12 192
# 54 8 2013-01-15 130
# 48 10 2013-01-17 153
# 78 9 2013-01-22 171
What I am trying to accomplish is to get the count and the sum of the 'Value' column, grouped by 'Client' for various time periods (Monthly, Quarterly, Yearly - I will likely build 3 different dataframes for this data, then make the dataframes 'wide').
For Quarterly, I would expect something like this:
Client OrderDate NumberofEntries SumofValues
1 2013-03-31 7 28
1 2013-06-30 2 7
1 2013-09-30 6 20
1 2013-12-31 1 3
2 2013-03-31 1 4
2 2013-06-30 2 8
2 2013-09-30 3 17
2 2013-12-31 4 24
I could append that data frame by getting the quarter for each entry (or Month, or Year), then use Pandas
groupby
function, but that seems like it's extra work when I should be using TimeSeries
.
I've read the documentation and reviewed a TimeSeries
demonstration by Wes, but I don't see a way to do a groupby
for the Client, then perform the TimeSeries
over the time periods I am trying to build (Alternatively - I could run a for loop
and build the dataframe that way, but again - seems like that's more work than there should be.)
Is there a way to combine a groupby
process with TimeSeries
?
A slight alternative is to set_index
before doing the groupby:
In [11]: df.set_index('OrderDate', inplace=True)
In [12]: g = df.groupby('Client')
In [13]: g['Value'].resample('Q', how=[np.sum, len])
Out[13]:
sum len
Client OrderDate
1 2013-03-31 239 1
2013-06-30 83 1
2013-09-30 249 2
2013-12-31 506 3
2 2013-03-31 581 4
2013-06-30 569 4
2013-09-30 316 4
2013-12-31 465 5
...
Note: you don't need to do the sort before doing this.
Something like this? I'm first doing a groupby, and then applying a resample on each group.
In [11]: grouped = df.groupby('Client')
In [12]: result = grouped.apply(lambda x: x.set_index('OrderDate').resample('Q', how=[np.sum, len]))
In [13]: result['Value']
Out[13]:
sum len
Client OrderDate
1 2013-03-31 227 4
2013-06-30 344 2
2013-09-30 234 1
2 2013-03-31 299 2
2013-06-30 538 4
2013-09-30 236 2
2013-12-31 1124 7
3 2013-03-31 496 4
2013-06-30 NaN 0
2013-09-30 167 2
2013-12-31 218 1
Update: with the suggestion of @AndyHayden in his answer, this becomes much cleaner code:
df = df.set_index('OrderDate')
grouped = df.groupby('Client')
grouped['Value'].resample('Q', how=[np.sum, len])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With