Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get various combinations of categories in a categorical variable and at the same time aggregate it?

I am trying to get various combinations for the data in three columns, and while doing so, I also want to aggregate (sum) the values.

My data is shown as below and following that is my sample output :

Dim1    Dim2    Dim3    Spend
A       X       Z       100
A       Y       Z       200
B       X       Z       300
B       Y       Z       400

Sample output :

Dim 1   Dim 2   Dim 3   Spend
A       NaN     NaN     300
A       X       NaN     100
A       Y       NaN     200
A       NaN     Z       300
B       NaN     NaN     700
B       X       NaN     300
B       Y       NaN     400
B       NaN     Z       700
NaN     X       Z       400
NaN     Y       Z       600
NaN     NaN     Z       1000
NaN     X       NaN     400
NaN     Y       NaN     600
A       X       Z       100
A       Y       Z       200
B       X       Z       300
B       Y       Z       400

Dim1, Dim2, Dim3 are categorical variables and Spend is a value/metric. We need to find total of Spend on all the possible combinations of the categorical variables and this part I am able to achieve using itertools.combinations(). Now, not only for three columns, we can also get the combinations for any number of such variables like Dim1, Dim2, Dim3 .. Dim 30 and so on.

My problem is I am unable to aggregate on the same, like for example, in row 12, for the Spend value for category Z, we are performing the sum() of all the values where Z has appeared in the main data, hence the value 1000. How do we achieve that for of aggregates?


Reproducible data :

data = pd.DataFrame({'Dim1': ['A', 'A', 'B', 'B'],
 'Dim2': ['X', 'Y', 'X', 'Y'],
 'Dim3': ['Z', 'Z', 'Z', 'Z'],
 'Spend': [100, 200, 300, 400]})
like image 927
sunitprasad1 Avatar asked Oct 31 '25 09:10

sunitprasad1


1 Answers

Digest

You've got the right idea of using itertools.combinations(). Further key steps:

  1. Apply itertools.combinations() on every possible number of dimensions to be summed up (from 1 to n_dim-1). i.e. itertools.combinations(range(1, 1+n_dim), i), for i in range(1, 1+n_dim).
  2. Use df.groupby(by=column_combinations).sum() to get the results from the combinations of classes automatically.

Code

The program consists of 3 logical parts.

  1. Aggregate by a class from every dimension. This part basically equals to what you have done, but is re-designed by a DFS method to reduce the total amount of data being processed. This can be useful when there are millions of rows to be processed. Later steps were also computed based on this intermediate dataset instead of the raw dataset.
  2. A generator to loop through dimensional combinations mentioned in Digest 1 and without explicit enumeration.
  3. Perform group-by computations mentioned in Digest 2 and output a list of result dataframes which can be concatenated at the end of the program.

Caution: Be sure to test for performance and memory issues in production use.

import pandas as pd
import numpy as np
import itertools

df = pd.DataFrame(
    {'Dim1': ['A', 'A', 'B', 'B'],
     'Dim2': ['X', 'Y', 'X', 'Y'],
     'Dim3': ['Z', 'Z', 'Z', 'Z'],
     'Spend': [100, 200, 300, 400]
     }
)

# constants: column names and dimensions
n_dim = 3
dim_cols = [f"Dim{i}" for i in range(1, n_dim + 1)]
cols = dim_cols + ["Spend"]


# 1. compute sums with every dimension
def dfs(df, ls_out, dim_now=1, ls_classes=[]):

    # termination condition (every dimension has been traversed)
    if dim_now == n_dim + 1:
        # perform aggregation
        sum = df["Spend"].sum()
        ls_classes.append(sum)
        ls_out.append(ls_classes)
        return

    # proceed
    col = f"Dim{dim_now}"

    # get categories
    classes = df[col].unique()
    classes.sort()

    for c in classes:
        # recurse next dimension with subset data
        dfs(df[df[col] == c], ls_out,
            dim_now=dim_now + 1,
            ls_classes=ls_classes + [c])

ls_out = []  # the output container
dfs(df, ls_out)
# convert to dataframe
df_every_dim = pd.DataFrame(data=ls_out, columns=df.columns)
del ls_out
print(df_every_dim)


# 2. generate combinations of groupby-dimensions
def multinomial_combinations(n_dim):
    for i in range(1, 1+n_dim):
        for tup in itertools.combinations(range(1, 1+n_dim), i):
            yield tup

print("Check multinomial_combinations(4):")
for i in multinomial_combinations(4):
    print(i)

# 3. Sum based on from df_every_dim
def aggr_by_dims(df, by_dims):

    # guard
    if not (0 < len(by_dims) < n_dim):
        raise ValueError(f"Wrong n_dim={n_dim}, len(by_dims)={len(by_dims)}")

    # by-columns
    by_cols = [f"Dim{i}" for i in by_dims]

    # groupby-sum
    df_grouped = df.groupby(by=by_cols).sum().reset_index()

    # create none-columns (cannot be empty here)
    arr = np.ones(n_dim+1, dtype=int)
    arr[list(by_dims)] = 0
    for i in range(1, 1+n_dim):
        if arr[i] == 1:
            df_grouped[f"Dim{i}"] = None  # or np.nan as you wish

    # reorder columns
    return df_grouped[cols]

print("\nCheck aggr_by_dims(df_every_dim, [1, 3]):")
print(aggr_by_dims(df_every_dim, [1, 3]))

# combine 2. and 3.
ls = []
for by_dims in multinomial_combinations(n_dim):
    if len(by_dims) < n_dim:
        df_grouped = aggr_by_dims(df_every_dim, by_dims)
        ls.append(df_grouped)

# no none-dimensions
ls.append(df_every_dim)

# final result
df_ans = pd.concat(ls, axis=0)
df_ans.reset_index(drop=True, inplace=True)
print(df_ans)

Output

(Intermediate outputs were omitted)

    Dim1  Dim2  Dim3  Spend
0      A  None  None    300
1      B  None  None    700
2   None     X  None    400
3   None     Y  None    600
4   None  None     Z   1000
5      A     X  None    100
6      A     Y  None    200
7      B     X  None    300
8      B     Y  None    400
9      A  None     Z    300
10     B  None     Z    700
11  None     X     Z    400
12  None     Y     Z    600
13     A     X     Z    100
14     A     Y     Z    200
15     B     X     Z    300
16     B     Y     Z    400
like image 183
Bill Huang Avatar answered Nov 04 '25 03:11

Bill Huang



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!