I am new to python. Currently, I have a dataset that contains more than 40 columns needed to be grouped and aggregated. I was wondering if python has a function similar to cube() function in SQL. If not, how can I have the expected output? Really appreciate it if someone could answer for me. Thanks!
Below Example that I simplified to 2 columns only (Country & Attribute A):
| CustomerID | Country |Attribute A|Sales| No.of product| No. of transaction|
| ---------- | --------|-----------|-----|--------------|-------------------|
| 1 | US |A |20 |2 |2 |
| 2 | US |B |25 |3 |3 |
|3 |CA |A |100 |10 |10 |
|4 |CA |B |50 |5 |5 |
|5 |UK |A |40 |4 |4 |
Expected Output:
| Country|Attribute A|Sum of Sales|Total no. of product| Total no. of transaction| Total no. of customer|
|--------|-----------|------------|--------------------|-------------------------|----------------------|
|US |(null) |45 |5 |5 |2 |
|CA |(null) |150 |15 |155 |2 |
|UK |(null) |40 |4 |4 |1 |
|(null) |A |160 |16 |16 |3 |
|(null) |B |75 |8 |8 |2 |
|US |A |20 |2 |2 |1 |
|US |B |25 |3 |3 |1 |
|CA |A |100 |10 |10 |1 |
|CA |B |50 |5 |5 |1 |
|UK |A |40 |4 |4 |1 |
Here's an extension of the other two answers that generalizes to arbitrarily many columns.
import pandas as pd
from itertools import combinations
def cube_sum(df,cols):
dfs = []
for n in range(len(cols),0,-1):
for subset in combinations(cols,n):
dfs.append(df.groupby(list(subset)).sum().reset_index())
dfs.append(df.drop(cols, axis = 1).sum().to_frame().T)
return pd.concat(dfs)
df = pd.DataFrame(
{'CustomerID':[1,2,3,4,5],
'Country':['US','US','CA','CA','UK'],
'Attribute A':list("ABABA"),
'Sales':[20,50,100,50,40],
'No. of Product':[2,3,10,5,4],
'No. of Transaction':[2,3,10,5,4]}).set_index('CustomerID')
all_groups = cube_sum(df,["Attribute A","Country"])
The result:
Attribute A Country Sales No. of Product No. of Transaction
0 A CA 100 10 10
1 A UK 40 4 4
2 A US 20 2 2
3 B CA 50 5 5
4 B US 50 3 3
0 A NaN 160 16 16
1 B NaN 100 8 8
0 NaN CA 150 15 15
1 NaN UK 40 4 4
2 NaN US 70 5 5
0 NaN NaN 260 24 24
In response to the comment:
import pandas as pd
from itertools import combinations
def cube_agg(df,cols):
dfs = []
for n in range(len(cols),0,-1):
for subset in combinations(cols,n):
dfs.append(df.groupby(list(subset))
.aggregate({'Population': 'mean', 'Area': ['min', 'max']})
.reset_index())
dfs.append((df.drop(["Attribute A","Country"], axis = 1)
.aggregate({'Population': 'mean', 'Area': ['min', 'max']})
.T.stack()
.to_frame().T))
return pd.concat(dfs)
df = pd.DataFrame(
{'CustomerID':[1,2,3,4,5],
'Country':['US','US','CA','CA','UK'],
'Attribute A':list("ABABA"),
'Population':[20,50,100,50,40],
'Area':[2,3,10,5,4]}).set_index('CustomerID')
all_groups = cube_agg(df,["Attribute A","Country"])
Result:
Area Attribute A Country Population
max min mean
0 10.0 10.0 A CA 100.000000
1 4.0 4.0 A UK 40.000000
2 2.0 2.0 A US 20.000000
3 5.0 5.0 B CA 50.000000
4 3.0 3.0 B US 50.000000
0 10.0 2.0 A NaN 53.333333
1 5.0 3.0 B NaN 50.000000
0 10.0 5.0 NaN CA 75.000000
1 4.0 4.0 NaN UK 40.000000
2 3.0 2.0 NaN US 35.000000
0 10.0 2.0 NaN NaN 52.000000
The trickiest aspect of this is getting the bottom row (overall aggregate) to work as expected. Note that the columns seem to be arranged in alphabetical order, hence the area coming first.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With