I have this type of data, but in real life it has millions of entries. Product id is always product specific, but occurs several times during its lifetime.
date | product id | revenue | estimated lifetime value |
---|---|---|---|
2021-04-16 | 0061M00001AXc5lQAD | 970 | 2000 |
2021-04-17 | 0061M00001AXbCiQAL | 159 | 50000 |
2021-04-18 | 0061M00001AXb9AQAT | 80 | 3000 |
2021-04-19 | 0061M00001AXbIHQA1 | 1100 | 8000 |
2021-04-20 | 0061M00001AXbY8QAL | 90 | 4000 |
2021-04-21 | 0061M00001AXbQ1QAL | 29 | 30000 |
2021-04-21 | 0061M00001AXc5lQAD | 30 | 2000 |
2021-05-02 | 0061M00001AXc5lQAD | 50 | 2000 |
2021-05-05 | 0061M00001AXc5lQAD | 50 | 2000 |
I'm looking to create a new column in pandas that indicates when a certain product id has generated more revenue than a specific threshold e.g. 100$, 1000$, marking it as a Win (1). A win may occur only once during the lifecycle of a product. In addition I would want to create another column that would indicate the row where a specific product sales exceeds e.g. 10% of the estimated lifetime value.
What would be the most intuitive approach to achieve this in Python / Pandas?
edit:
dw1k_thresh: if the cumulative sales of a specific product id >= 1000, the column takes a boolean value of 1, otherwise zero. However 1 can occur only once and after that the again always zero. Basically it's just an indicator of the date and transaction when a product sales exceeds the critical value of 1000.
dw10perc: if the cumulative sales of one product id >= 10% of estimated lifetime value, the column takes value of 1, otherwise 0. However 1 can occur only once and after that the again always zero. Basically it's just an indicator of the date and transaction when a product sales exceeds the critical value of 10% of the estimated lifetime value.
The threshold value is common for all product id's (I'll just replicate the process with different thresholds at a later stage to determine which is the optimal threshold to predict future revenue).
I'm trying to achieve this:
The code I've written so far is trying to establish the cum_rev and dw1k_thresh columns, but unfortunately it doesn't work.
df_final["dw1k_thresh"] = 0
df_final["cum_rev"]= 0
opp_list =set()
for row in df_final["product id"].iteritems():
opp_list.add(row)
opp_list=list(opp_list)
opp_list=pd.Series(opp_list)
for i in opp_list:
if i == df_final["product id"].any():
df_final.cum_rev = df_final.revenue.cumsum()
for x in df_final.cum_rev:
if x >= 1000 & df_final.dw1k_thresh.sum() == 0:
df_final.dw1k_thresh = 1
else:
df_final.dw1k_thresh = 0
df_final.head(30)
groupby
and cumsum
.As a first step you would need to remove $
and make sure your columns are of numeric type to perform the comparisons you outlined.
# Imports
import pandas as pd
import numpy as np
# Remove $ sign and convert to numeric
cols = ['revenue','estimated lifetime value']
df[cols] = df[cols].replace({'\$': '', ',': ''}, regex=True).astype(float)
# Cumulative Revenue
df['cum_rev'] = df.groupby('product id')['revenue'].cumsum()
# Function to be applied on both
def f(df,thresh_col):
return (df[df[thresh_col]==1].sort_values(['date','product id'], ascending=False)
.groupby('product id', as_index=False,group_keys=False)
.apply(lambda x: x.tail(1))
).index.tolist()
# dw1k_thresh
df['dw1k_thresh'] = np.where(df['cum_rev'].ge(1000),1,0)
df['dw1k_thresh'] = np.where(df.index.isin(f(df,'dw1k_thresh')),1,0)
# dw10perc
df['dw10_perc'] = np.where(df['cum_rev'] > 0.10 * df.groupby('product id',observed=True)['estimated lifetime value'].transform('sum'),1,0)
df['dw10_perc'] = np.where(df.index.isin(f(df,'dw10_perc')),1,0)
Prints:
>>> df
date product id revenue ... cum_rev dw1k_thresh dw10_perc
0 2021-04-16 0061M00001AXc5lQAD 970 ... 970 0 1
1 2021-04-17 0061M00001AXbCiQAL 159 ... 159 0 0
2 2021-04-18 0061M00001AXb9AQAT 80 ... 80 0 0
3 2021-04-19 0061M00001AXbIHQA1 1100 ... 1100 1 1
4 2021-04-20 0061M00001AXbY8QAL 90 ... 90 0 0
5 2021-04-21 0061M00001AXbQ1QAL 29 ... 29 0 0
6 2021-04-21 0061M00001AXc5lQAD 30 ... 1000 1 0
7 2021-05-02 0061M00001AXc5lQAD 50 ... 1050 0 0
8 2021-05-05 0061M00001AXc5lQAD 50 ... 1100 0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With