I have a Pandas dataframe with the following columns
date | months | price
I calculate some basic BI metrics. I did the Net Revenue by grouping the dataframe on date and sum the price:
df = df[["Date", "Price"]].groupby(df['Date'])["Price"].sum().reset_index()
Now, I want to find the MRR, which is similar to the Net Revenue, but in case the column months have more than 1 month, the price should be "moved" equally to the next months. And also, it is grouped by month and not day.
For example, if I am on January 2016 and I have a row with 3 months and price 30$, I should add 10$ to January, 10$ to February and 10$ to March.
My first idea was to iterate through the dataframe, keep track of the months and the amount of price I should "move" on next months and create a new dataframe manually.
But, first, is there any Pythonic way in Pandas to do it?
Data to reproduce a dataframe:
import pandas as pd
df = pd.DataFrame({'date': ['01-01-2016', '05-01-2016', '10-01-2016','04-02-2016'],
'months': [1, 3, 1, 6],
'price': [40, 60, 20, 60]})
Desired result:
Date | MRR
January 2016 | 80
February 2016| 30
March 2016 | 10
April 2016 | 10
May 2016 | 10
June 2016 | 10
July 2016 | 10
And the results calculated like this for each row
January 2016 = 40 + 20 + 20 + 0
February 2016 = 0 + 20 + 0 + 10
March 2016 = 0 + 0 + 0 + 10
April 2016 = 0 + 0 + 0 + 10
May 2016 = 0 + 0 + 0 + 10
June 2016 = 0 + 0 + 0 + 10
July 2016 = 0 + 0 + 0 + 10
I don't know any way around using a loop. However, I can suggest a way to make the code pretty clean and efficient.
First, let's load the example data you supplied in the question text:
df = pd.DataFrame({'date': ['01-01-2016', '05-01-2016', '10-01-2016','04-02-2016'],
'months': [1, 3, 1, 6],
'price': [40, 60, 20, 60]})
In order to use Panda's date functionality (e.g. grouping by month), we will use the date column as index. A DateTimeIndex in fact:
df['date'] = pd.to_datetime(df['date'], format='%d-%m-%Y')
df = df.set_index('date')
Now, it's really easy to, for example, view a month-by-month summary, by using the resample function that works like the groupby function you already know, but uses time-periods:
df.resample('M').sum()
Now to "spread out" rows where the months column is > 1 over multiple months. My approach here is to generate a new DataFrame for each row:
dfs = []
for date, values in df.iterrows():
months, price = values
dfs.append(
pd.DataFrame(
# Compute the price for each month, and repeat this value
data={'price': [price / months] * months},
# The index is a date range for the requested number of months
index=pd.date_range(date, periods=months, freq='M')
)
)
Now we can just concatenate the list of DataFrames, resample to months and take the sum:
pd.concat(dfs).resample('M').sum()
Output:
price
2016-01-31 80
2016-02-29 30
2016-03-31 30
2016-04-30 10
2016-05-31 10
2016-06-30 10
2016-07-31 10
See http://pandas.pydata.org/pandas-docs/stable/timeseries.html for all the cool things Panda's can do regarding time. For example, to exactly produce your desired output you could do this:
output.index = output.index.strftime('%B %Y')
Which results in this:
price
January 2016 80
February 2016 30
March 2016 30
April 2016 10
May 2016 10
June 2016 10
July 2016 10
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With