Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cumulative sales data with threshold value forming a new series / column with a boolean value?

Tags:

I have this type of data, but in real life it has millions of entries. Product id is always product specific, but occurs several times during its lifetime.

date product id revenue estimated lifetime value
2021-04-16 0061M00001AXc5lQAD 970 2000
2021-04-17 0061M00001AXbCiQAL 159 50000
2021-04-18 0061M00001AXb9AQAT 80 3000
2021-04-19 0061M00001AXbIHQA1 1100 8000
2021-04-20 0061M00001AXbY8QAL 90 4000
2021-04-21 0061M00001AXbQ1QAL 29 30000
2021-04-21 0061M00001AXc5lQAD 30 2000
2021-05-02 0061M00001AXc5lQAD 50 2000
2021-05-05 0061M00001AXc5lQAD 50 2000

I'm looking to create a new column in pandas that indicates when a certain product id has generated more revenue than a specific threshold e.g. 100$, 1000$, marking it as a Win (1). A win may occur only once during the lifecycle of a product. In addition I would want to create another column that would indicate the row where a specific product sales exceeds e.g. 10% of the estimated lifetime value.

What would be the most intuitive approach to achieve this in Python / Pandas?

edit:

  • dw1k_thresh: if the cumulative sales of a specific product id >= 1000, the column takes a boolean value of 1, otherwise zero. However 1 can occur only once and after that the again always zero. Basically it's just an indicator of the date and transaction when a product sales exceeds the critical value of 1000.

  • dw10perc: if the cumulative sales of one product id >= 10% of estimated lifetime value, the column takes value of 1, otherwise 0. However 1 can occur only once and after that the again always zero. Basically it's just an indicator of the date and transaction when a product sales exceeds the critical value of 10% of the estimated lifetime value.

  • The threshold value is common for all product id's (I'll just replicate the process with different thresholds at a later stage to determine which is the optimal threshold to predict future revenue).

I'm trying to achieve this: enter image description here

The code I've written so far is trying to establish the cum_rev and dw1k_thresh columns, but unfortunately it doesn't work.

df_final["dw1k_thresh"] = 0
df_final["cum_rev"]= 0 

opp_list =set()

for row in df_final["product id"].iteritems():
    opp_list.add(row)


opp_list=list(opp_list)
opp_list=pd.Series(opp_list)


for i in opp_list: 
    if i == df_final["product id"].any():
        df_final.cum_rev = df_final.revenue.cumsum()

    for x in df_final.cum_rev:
        if x >= 1000 & df_final.dw1k_thresh.sum() == 0: 
            df_final.dw1k_thresh = 1
        else: 
            df_final.dw1k_thresh = 0

df_final.head(30)
like image 497
Alex Günsberg Avatar asked Apr 19 '21 09:04

Alex Günsberg


1 Answers

  1. Cumulative Revenue: Can be calculated fairly simply with groupby and cumsum.
  2. dwk1k_thresh: We are first checking whether cum_rev is greater than 1000 and then apply the function that helps us keep 1 only once, and after that the again always zero.
  3. dw10_perc: Same approach as dw1k_thresh.

As a first step you would need to remove $ and make sure your columns are of numeric type to perform the comparisons you outlined.

# Imports
import pandas as pd
import numpy as np

# Remove $ sign and convert to numeric
cols = ['revenue','estimated lifetime value']
df[cols] = df[cols].replace({'\$': '', ',': ''}, regex=True).astype(float)

# Cumulative Revenue
df['cum_rev'] = df.groupby('product id')['revenue'].cumsum()

# Function to be applied on both
def f(df,thresh_col):
    return  (df[df[thresh_col]==1].sort_values(['date','product id'], ascending=False)
            .groupby('product id', as_index=False,group_keys=False)
            .apply(lambda x: x.tail(1))
            ).index.tolist()

# dw1k_thresh
df['dw1k_thresh'] = np.where(df['cum_rev'].ge(1000),1,0)
df['dw1k_thresh'] = np.where(df.index.isin(f(df,'dw1k_thresh')),1,0)

# dw10perc
df['dw10_perc'] = np.where(df['cum_rev'] > 0.10 * df.groupby('product id',observed=True)['estimated lifetime value'].transform('sum'),1,0)
df['dw10_perc'] = np.where(df.index.isin(f(df,'dw10_perc')),1,0)

Prints:

>>> df

        date          product id  revenue  ...  cum_rev  dw1k_thresh  dw10_perc
0 2021-04-16  0061M00001AXc5lQAD      970  ...      970            0          1
1 2021-04-17  0061M00001AXbCiQAL      159  ...      159            0          0
2 2021-04-18  0061M00001AXb9AQAT       80  ...       80            0          0
3 2021-04-19  0061M00001AXbIHQA1     1100  ...     1100            1          1
4 2021-04-20  0061M00001AXbY8QAL       90  ...       90            0          0
5 2021-04-21  0061M00001AXbQ1QAL       29  ...       29            0          0
6 2021-04-21  0061M00001AXc5lQAD       30  ...     1000            1          0
7 2021-05-02  0061M00001AXc5lQAD       50  ...     1050            0          0
8 2021-05-05  0061M00001AXc5lQAD       50  ...     1100            0          0
like image 99
sophocles Avatar answered Sep 30 '22 18:09

sophocles