Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create column based on percentage of recurring customers

I have a DataFrame which contains order data, specified per row. So each row is a different order.

  • date_created
  • customer_id
  • total_value
  • recurring_customer

A customer is a recurring customer when they have ordered for the third time. I want to find out the percentage to which returning customers contribute to the total value.

The DataFrame looks like this:

df = pd.DataFrame(
    {
        "date_created" ["2019-11-16", "2019-11-16", "2019-11-16", "2019-11-16", "2019-11-16", "2019-11-16"]
        "customer_id": ["1733", "6356", "6457", "6599", "6637", "6638"],
        "total": ["746.02", "1236.60", "1002.32", "1187.21", "1745.03", "2313.14"],
        "recurring_customer": ["False", "False", "False", "False", "False", "False"],
    }
)

By resampling the data to monthly data:

df_monthly = df.resample('1M').mean()

I got the following output:

df_monthly = pd.DataFrame(
    {
        "date_created": ["2019-11-30", "2019-12-31", "2020-01-31", "2020-02-29", "2020-03-31", "2020-04-30"]
        "customer_id": ["4987.02", "5291.56", "5702.13", "6439.27", "7263.11", "8080.91",],
        "total": ["2915.25", "2550.85", "2486.72", "2515.81", "2633.77", "2558.19"],
        "recurring_customer": ["0.009050", "0.016667", "0.075630", "0.138122", "0.130045", "0.175503"],
    }
)

So, the real question is that I want to find out the percentage to which returning customers contribute to the total value of the month.

The desired output should look something like this:

| date_created | customer_id | total   | recurring_customer | recurring_customer_total | recurring_customer_total_percentage | 
| ------------ | ----------- | ------  | ------------------ | ------------------------ | ----------------------------------- |
|  2019-11-30  |  4987.02    | 2915.25 |       0.009050     |         ??????           |        ??????
|  2019-12-31  |  5291.56    | 2550.85 |       0.016667     |         ??????           |        ??????
|  2020-01-31  |  5702.13    | 2486.72 |       0.075630     |         ??????           |        ??????
|  2020-02-29  |  6439.27    | 2515.81 |       0.138122     |         ??????           |        ??????
|  2020-03-31  |  7263.11    | 2633.77 |       0.130045     |         ??????           |        ??????
|  2020-04-30  |  8080.91    | 2558.19 |       0.175503     |         ??????           |        ??????

Note that I can't just calculate the recurring_customer percentages times the total value because I assume the group of recurring customers contribute a lot more to the total value than customers who aren't a recurring customer.

I tried the np.where() function on the daily dataframe, where :

  • I would create a column 'recurring_customer_total' in the daily dataframe and it would copy the value of the 'total' column but only when 'recurring_customer' return True, otherwise return 0. I found a similar question here: get values from first column when other columns are true using a lookup list. Another similar question was asked here: Getting indices of True values in a boolean list. This answer returns all 'True' values and it's position, I want the value of 'total' copied into 'recurring_customer_total' when 'recurring_customer' is 'True'.
  • Then I would resample the daily dataframe to a monthly dataframe and that would give me the mean of the amount 'recurring_customers' contributed to the total value. Those values would be visible in 'recurring_customers_total'.
  • The final step would be to calculate the percentage of the 'recurring_customer_total' based on the 'total' column. Those values should be stored in 'recurrings_customer_total_percentage'.

I think those are the steps I need to follow, the only problem is that I don't really know how to get there.

Thanks in advance!

like image 203
Jordy Slinkman Avatar asked Nov 05 '22 23:11

Jordy Slinkman


1 Answers

So I'm fairly new to Python but I've managed to answer my own question. Can't say this is the best, easiest, fastest way but it surely helped.

First of all I made a new dataframe which is an exact copy of the original dataframe, but only with 'True' values of the column 'recurring_customer'. I did that by using the following code:

df_recurring_customers = df.loc[df['recurring_customer'] == True]

It gave me the following dataframe:

df_recurring_customers.head()
    {
        "date_created" ["2019-11-25", "2019-11-28", "2019-12-02", "2019-12-09", "2019-12-11"]
        "customer_id": ["577", "6457", "577", "6647", "840"],
        "total": ["33891.12", "81.98", "9937.68", "1166.28", "2969.60"],
        "recurring_customer": ["True", "True", "True", "True", "True"],
    }
)

Then I resampled the values using:

df_recurring_customers_monthly_sum = df_recurring_customers.resample('1M').sum()

I then dropped the 'number' and 'customer_id' column, which had no value. The next step was to join the two dataframes 'df_monthly' and 'df_recurring_customers_monthly_sum' using:

df_total = df_recurring_customers_monthly_sum.join(df_monthly)

This gave me:

| date_created | total      | recurring_customer_total |
| ------------ | ---------- | ------------------------ |
|  2019-11-30  | 644272.02  |         33973.10         |
|  2019-12-31  | 612205.99  |         15775.29         |
|  2020-01-31  | 887761.60  |         61612.27         |
|  2020-02-29  | 910724.75  |         125315.31        |
|  2020-03-31  | 1174662.59 |         125315.31        |
|  2020-04-30  | 1399332.26 |         248277.97        |

Then I wanted to know the percentage so

df_total['total_recurring_customer_percentage'] = (df_total['recurring_customer_total'] / df_total['total']) * 100

Which gave me:

| date_created | total      | recurring_customer_total | recurring_customer_total_percentage | 
| ------------ | ---------- | ------------------------ | ----------------------------------- |
|  2019-11-30  | 644272.02  |         33973.10         |        5.273099
|  2019-12-31  | 612205.99  |         15775.29         |        2.576794
|  2020-01-31  | 887761.60  |         61612.27         |        6.940182
|  2020-02-29  | 910724.75  |         125315.31        |        13.759954
|  2020-03-31  | 1174662.59 |         125315.31        |        13.967221
|  2020-04-30  | 1399332.26 |         248277.97        |        17.742603
like image 151
Jordy Slinkman Avatar answered Nov 14 '22 23:11

Jordy Slinkman