Cumulative sales data with threshold value forming a new series / column with a boolean value?

Question

I have this type of data, but in real life it has millions of entries. Product id is always product specific, but occurs several times during its lifetime.

date	product id	revenue	estimated lifetime value
2021-04-16	0061M00001AXc5lQAD	970	2000
2021-04-17	0061M00001AXbCiQAL	159	50000
2021-04-18	0061M00001AXb9AQAT	80	3000
2021-04-19	0061M00001AXbIHQA1	1100	8000
2021-04-20	0061M00001AXbY8QAL	90	4000
2021-04-21	0061M00001AXbQ1QAL	29	30000
2021-04-21	0061M00001AXc5lQAD	30	2000
2021-05-02	0061M00001AXc5lQAD	50	2000
2021-05-05	0061M00001AXc5lQAD	50	2000

I'm looking to create a new column in pandas that indicates when a certain product id has generated more revenue than a specific threshold e.g. 100$, 1000$, marking it as a Win (1). A win may occur only once during the lifecycle of a product. In addition I would want to create another column that would indicate the row where a specific product sales exceeds e.g. 10% of the estimated lifetime value.

What would be the most intuitive approach to achieve this in Python / Pandas?

edit:

dw1k_thresh: if the cumulative sales of a specific product id >= 1000, the column takes a boolean value of 1, otherwise zero. However 1 can occur only once and after that the again always zero. Basically it's just an indicator of the date and transaction when a product sales exceeds the critical value of 1000.
dw10perc: if the cumulative sales of one product id >= 10% of estimated lifetime value, the column takes value of 1, otherwise 0. However 1 can occur only once and after that the again always zero. Basically it's just an indicator of the date and transaction when a product sales exceeds the critical value of 10% of the estimated lifetime value.
The threshold value is common for all product id's (I'll just replicate the process with different thresholds at a later stage to determine which is the optimal threshold to predict future revenue).

I'm trying to achieve this: enter image description here

The code I've written so far is trying to establish the cum_rev and dw1k_thresh columns, but unfortunately it doesn't work.

df_final["dw1k_thresh"] = 0
df_final["cum_rev"]= 0 

opp_list =set()

for row in df_final["product id"].iteritems():
    opp_list.add(row)


opp_list=list(opp_list)
opp_list=pd.Series(opp_list)


for i in opp_list: 
    if i == df_final["product id"].any():
        df_final.cum_rev = df_final.revenue.cumsum()

    for x in df_final.cum_rev:
        if x >= 1000 & df_final.dw1k_thresh.sum() == 0: 
            df_final.dw1k_thresh = 1
        else: 
            df_final.dw1k_thresh = 0

df_final.head(30)

sophocles · Accepted Answer

Cumulative Revenue: Can be calculated fairly simply with groupby and cumsum.
dwk1k_thresh: We are first checking whether cum_rev is greater than 1000 and then apply the function that helps us keep 1 only once, and after that the again always zero.
dw10_perc: Same approach as dw1k_thresh.

As a first step you would need to remove $ and make sure your columns are of numeric type to perform the comparisons you outlined.

# Imports
import pandas as pd
import numpy as np

# Remove $ sign and convert to numeric
cols = ['revenue','estimated lifetime value']
df[cols] = df[cols].replace({'\$': '', ',': ''}, regex=True).astype(float)

# Cumulative Revenue
df['cum_rev'] = df.groupby('product id')['revenue'].cumsum()

# Function to be applied on both
def f(df,thresh_col):
    return  (df[df[thresh_col]==1].sort_values(['date','product id'], ascending=False)
            .groupby('product id', as_index=False,group_keys=False)
            .apply(lambda x: x.tail(1))
            ).index.tolist()

# dw1k_thresh
df['dw1k_thresh'] = np.where(df['cum_rev'].ge(1000),1,0)
df['dw1k_thresh'] = np.where(df.index.isin(f(df,'dw1k_thresh')),1,0)

# dw10perc
df['dw10_perc'] = np.where(df['cum_rev'] > 0.10 * df.groupby('product id',observed=True)['estimated lifetime value'].transform('sum'),1,0)
df['dw10_perc'] = np.where(df.index.isin(f(df,'dw10_perc')),1,0)

Prints:

>>> df

        date          product id  revenue  ...  cum_rev  dw1k_thresh  dw10_perc
0 2021-04-16  0061M00001AXc5lQAD      970  ...      970            0          1
1 2021-04-17  0061M00001AXbCiQAL      159  ...      159            0          0
2 2021-04-18  0061M00001AXb9AQAT       80  ...       80            0          0
3 2021-04-19  0061M00001AXbIHQA1     1100  ...     1100            1          1
4 2021-04-20  0061M00001AXbY8QAL       90  ...       90            0          0
5 2021-04-21  0061M00001AXbQ1QAL       29  ...       29            0          0
6 2021-04-21  0061M00001AXc5lQAD       30  ...     1000            1          0
7 2021-05-02  0061M00001AXc5lQAD       50  ...     1050            0          0
8 2021-05-05  0061M00001AXc5lQAD       50  ...     1100            0          0

Cumulative sales data with threshold value forming a new series / column with a boolean value?

Tags:

Alex Günsberg

1 Answers

sophocles

Recent Activity

Donate For Us

Cumulative sales data with threshold value forming a new series / column with a boolean value?

Tags:

Alex Günsberg

1 Answers

sophocles

Related questions

Recent Activity

Donate For Us