Given the following dataset as a pandas dataframe df:
index(as DateTime object) | Name | Amount | IncomeOutcome
---------------------------------------------------------------
2019-01-28 | Customer1 | 200.0 | Income
2019-01-31 | Customer1 | 200.0 | Income
2019-01-31 | Customer2 | 100.0 | Income
2019-01-28 | Customer2 | -100.0 | Outcome
2019-01-31 | Customer2 | -100.0 | Outcome
We perform the following steps:
grouped = df.groupby("Name", "IncomeOutcome")
sampled_by_month = grouped.resample("M")
aggregated = sampled_by_month.agg({"MonthlyCount": "size", "Amount": "sum"})
The desired output should look like this:
Name | IncomeOutcome | Amount | MonthlyCount
------------------------------------------------------------
Customer1 | Income | 400.0 | 2
Customer2 | Income | 100.0 | 1
Customer2 | Outcome | -200.0 | 2
The last step performs very poorly, possibly related to Pandas Issue #20660 My first intention was to convert all datetime objects to int64, which leaves me with the question on how to resample the converted data by month.
Any suggestions on that issue?
Thank you in advance
Perhaps we can optimise your solution by having the resampling done only on a single column ("Amount", the column of interest).
(df.groupby(["Name", "IncomeOutcome"])['Amount']
.resample("M")
.agg(['sum','size'])
.rename({'sum':'Amount', 'size': 'MonthlyCount'}, axis=1)
.reset_index(level=-1, drop=True)
.reset_index())
Name IncomeOutcome Amount MonthlyCount
0 Customer1 Income 400.0 2
1 Customer2 Income 100.0 1
2 Customer2 Outcome -200.0 2
If this is still too slow, then I think the problem could be that the resample
being within the groupby
slows things down. Perhaps you can try grouping by all 3 predicates with a single groupby
call. For the date resampling, try pd.Grouper
.
(df.groupby(['Name', 'IncomeOutcome', pd.Grouper(freq='M')])['Amount']
.agg([ ('Amount', 'sum'), ('MonthlyCount', 'size')])
.reset_index(level=-1, drop=True)
.reset_index())
Name IncomeOutcome Amount MonthlyCount
0 Customer1 Income 400.0 2
1 Customer2 Income 100.0 1
2 Customer2 Outcome -200.0 2
Performance wise, this should come out faster.
Performance
Let's try setting up a more general DataFrame for the purpose of testing.
# Setup
df_ = df.copy()
df1 = pd.concat([df_.reset_index()] * 100, ignore_index=True)
df = pd.concat([
df1.replace({'Customer1': f'Customer{i}', 'Customer2': f'Customer{i+1}'})
for i in range(1, 98, 2)], ignore_index=True)
df = df.set_index('index')
df.shape
# (24500, 3)
%%timeit
(df.groupby(["Name", "IncomeOutcome"])['Amount']
.resample("M")
.agg(['sum','size'])
.rename({'sum':'Amount', 'size': 'MonthlyCount'}, axis=1)
.reset_index(level=-1, drop=True)
.reset_index())
%%timeit
(df.groupby(['Name', 'IncomeOutcome', pd.Grouper(freq='M')])['Amount']
.agg([ ('Amount', 'sum'), ('MonthlyCount', 'size')])
.reset_index(level=-1, drop=True)
.reset_index())
1.71 s ± 85.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
24.2 ms ± 1.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With