I am trying to get various combinations for the data in three columns, and while doing so, I also want to aggregate (sum) the values.
My data is shown as below and following that is my sample output :
Dim1 Dim2 Dim3 Spend
A X Z 100
A Y Z 200
B X Z 300
B Y Z 400
Sample output :
Dim 1 Dim 2 Dim 3 Spend
A NaN NaN 300
A X NaN 100
A Y NaN 200
A NaN Z 300
B NaN NaN 700
B X NaN 300
B Y NaN 400
B NaN Z 700
NaN X Z 400
NaN Y Z 600
NaN NaN Z 1000
NaN X NaN 400
NaN Y NaN 600
A X Z 100
A Y Z 200
B X Z 300
B Y Z 400
Dim1, Dim2, Dim3 are categorical variables and Spend is a value/metric. We need to find total of Spend on all the possible combinations of the categorical variables and this part I am able to achieve using itertools.combinations(). Now, not only for three columns, we can also get the combinations for any number of such variables like Dim1, Dim2, Dim3 .. Dim 30 and so on.
My problem is I am unable to aggregate on the same, like for example, in row 12, for the Spend value for category Z, we are performing the sum() of all the values where Z has appeared in the main data, hence the value 1000. How do we achieve that for of aggregates?
Reproducible data :
data = pd.DataFrame({'Dim1': ['A', 'A', 'B', 'B'],
'Dim2': ['X', 'Y', 'X', 'Y'],
'Dim3': ['Z', 'Z', 'Z', 'Z'],
'Spend': [100, 200, 300, 400]})
You've got the right idea of using itertools.combinations(). Further key steps:
itertools.combinations() on every possible number of dimensions to be summed up (from 1 to n_dim-1). i.e. itertools.combinations(range(1, 1+n_dim), i), for i in range(1, 1+n_dim).df.groupby(by=column_combinations).sum() to get the results from the combinations of classes automatically.The program consists of 3 logical parts.
Caution: Be sure to test for performance and memory issues in production use.
import pandas as pd
import numpy as np
import itertools
df = pd.DataFrame(
{'Dim1': ['A', 'A', 'B', 'B'],
'Dim2': ['X', 'Y', 'X', 'Y'],
'Dim3': ['Z', 'Z', 'Z', 'Z'],
'Spend': [100, 200, 300, 400]
}
)
# constants: column names and dimensions
n_dim = 3
dim_cols = [f"Dim{i}" for i in range(1, n_dim + 1)]
cols = dim_cols + ["Spend"]
# 1. compute sums with every dimension
def dfs(df, ls_out, dim_now=1, ls_classes=[]):
# termination condition (every dimension has been traversed)
if dim_now == n_dim + 1:
# perform aggregation
sum = df["Spend"].sum()
ls_classes.append(sum)
ls_out.append(ls_classes)
return
# proceed
col = f"Dim{dim_now}"
# get categories
classes = df[col].unique()
classes.sort()
for c in classes:
# recurse next dimension with subset data
dfs(df[df[col] == c], ls_out,
dim_now=dim_now + 1,
ls_classes=ls_classes + [c])
ls_out = [] # the output container
dfs(df, ls_out)
# convert to dataframe
df_every_dim = pd.DataFrame(data=ls_out, columns=df.columns)
del ls_out
print(df_every_dim)
# 2. generate combinations of groupby-dimensions
def multinomial_combinations(n_dim):
for i in range(1, 1+n_dim):
for tup in itertools.combinations(range(1, 1+n_dim), i):
yield tup
print("Check multinomial_combinations(4):")
for i in multinomial_combinations(4):
print(i)
# 3. Sum based on from df_every_dim
def aggr_by_dims(df, by_dims):
# guard
if not (0 < len(by_dims) < n_dim):
raise ValueError(f"Wrong n_dim={n_dim}, len(by_dims)={len(by_dims)}")
# by-columns
by_cols = [f"Dim{i}" for i in by_dims]
# groupby-sum
df_grouped = df.groupby(by=by_cols).sum().reset_index()
# create none-columns (cannot be empty here)
arr = np.ones(n_dim+1, dtype=int)
arr[list(by_dims)] = 0
for i in range(1, 1+n_dim):
if arr[i] == 1:
df_grouped[f"Dim{i}"] = None # or np.nan as you wish
# reorder columns
return df_grouped[cols]
print("\nCheck aggr_by_dims(df_every_dim, [1, 3]):")
print(aggr_by_dims(df_every_dim, [1, 3]))
# combine 2. and 3.
ls = []
for by_dims in multinomial_combinations(n_dim):
if len(by_dims) < n_dim:
df_grouped = aggr_by_dims(df_every_dim, by_dims)
ls.append(df_grouped)
# no none-dimensions
ls.append(df_every_dim)
# final result
df_ans = pd.concat(ls, axis=0)
df_ans.reset_index(drop=True, inplace=True)
print(df_ans)
(Intermediate outputs were omitted)
Dim1 Dim2 Dim3 Spend
0 A None None 300
1 B None None 700
2 None X None 400
3 None Y None 600
4 None None Z 1000
5 A X None 100
6 A Y None 200
7 B X None 300
8 B Y None 400
9 A None Z 300
10 B None Z 700
11 None X Z 400
12 None Y Z 600
13 A X Z 100
14 A Y Z 200
15 B X Z 300
16 B Y Z 400
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With