TL;DR: How can we achieve something similar to Group By Roll Up with any kind of aggregates in pandas? (Credit to @Scott Boston for this term)
I have following dataframe:
P Q R S T
0 PLAC NR F HOL F
1 PLAC NR F NHOL F
2 TRTB NR M NHOL M
3 PLAC NR M NHOL M
4 PLAC NR F NHOL F
5 PLAC R M NHOL M
6 TRTA R F HOL F
7 TRTA NR F HOL F
8 TRTB NR F NHOL F
9 PLAC NR F NHOL F
10 TRTB NR F NHOL F
11 TRTB NR M NHOL M
12 TRTA NR F HOL F
13 PLAC NR F HOL F
14 PLAC R F NHOL F
For a list of columns ['Q', 'R', 'S', 'T']
, I want to calculate some aggregates on P
column on following 4 list of grouping columns:
['Q']
['Q', 'R']
['Q', 'R', 'S']
['Q', 'R', 'S', 'T']
I've already written the code to group above dataframes in an increasing number of columns, and calculate the aggregate (using count
for the shake of simplicity) on each of the groupby object, and finally concatenate them:
cols = list('QRST')
aggCol = 'P'
groupCols = []
result = []
for col in cols:
groupCols.append(col)
result.append(df.groupby(groupCols)[aggCol].agg(count='count').reset_index())
result = pd.concat(result)[groupCols+['count']]
However, I've strong feeling that above method is not so efficient in terms of CPU time. Is there a more efficient way to apply aggregate on such continuously increasing number of columns for grouping?
Why I think it is not so efficient is because: For above values, in first iteration, it groups the dataframe on Q
column then calculates aggregate. Then in next iteration it groups the dataframe on Q
and R
, that means it again needs to group it by Q
then R
, but it was already grouped by Q
in the first iteration, so the same operation is repeating. If there is some way to utilize the previously created groups, I think it'll be efficient.
OUTPUT:
Q R S T count
0 NR NaN NaN NaN 12
1 R NaN NaN NaN 3
0 NR F NaN NaN 9
1 NR M NaN NaN 3
2 R F NaN NaN 2
3 R M NaN NaN 1
0 NR F HOL NaN 4
1 NR F NHOL NaN 5
2 NR M NHOL NaN 3
3 R F HOL NaN 1
4 R F NHOL NaN 1
5 R M NHOL NaN 1
0 NR F HOL F 4
1 NR F NHOL F 5
2 NR M NHOL M 3
3 R F HOL F 1
4 R F NHOL F 1
5 R M NHOL M 1
I already looked into Is there an equivalent of SQL GROUP BY ROLLUP in Python pandas? and Pandas Pivot tables row subtotals, they don't work in my case, I already tried them i.e. These method can be used to get the count only, and immediately fail even for unique counts when the same identifier appears for more than one values:
pd.pivot_table(df, aggCol, columns=cols, aggfunc='count', margins=True).T.reset_index()
Q R S T P
0 NR F HOL F 4
1 NR F NHOL F 5
2 NR M NHOL M 3
3 NR All 3
4 R F HOL F 1
5 R F NHOL F 1
6 R M NHOL M 1
7 R All 3
In order to avoid any unnecessary confusion with just getting the count
as per suggestion in the comment, I have added it for the mean as aggregate, changing P
column to a numeric type:
P Q R S T
0 9 NR F HOL F
1 7 NR F NHOL F
2 3 NR M NHOL M
3 9 NR M NHOL M
4 1 NR F NHOL F
5 0 R M NHOL M
6 1 R F HOL F
7 7 NR F HOL F
8 2 NR F NHOL F
9 2 NR F NHOL F
10 1 NR F NHOL F
11 2 NR M NHOL M
12 3 NR F HOL F
13 6 NR F HOL F
14 0 R F NHOL F
cols = list('QRST')
cols = list('QRST')
aggCol = 'P'
groupCols = []
result = []
for col in cols:
groupCols.append(col)
result.append(df.groupby(groupCols)[aggCol]
.agg(agg=np.mean)
.round(2).reset_index())
result = pd.concat(result)[groupCols+['agg']]
>>> result
Q R S T agg
0 NR NaN NaN NaN 4.33
1 R NaN NaN NaN 0.33
0 NR F NaN NaN 4.22
1 NR M NaN NaN 4.67
2 R F NaN NaN 0.50
3 R M NaN NaN 0.00
0 NR F HOL NaN 6.25
1 NR F NHOL NaN 2.60
2 NR M NHOL NaN 4.67
3 R F HOL NaN 1.00
4 R F NHOL NaN 0.00
5 R M NHOL NaN 0.00
0 NR F HOL F 6.25
1 NR F NHOL F 2.60
2 NR M NHOL M 4.67
3 R F HOL F 1.00
4 R F NHOL F 0.00
5 R M NHOL M 0.00
Specifies the type of aggregation to apply to summarized values. These values appear at the higher levels of lists and crosstabs. For OLAP data sources, a rollup aggregate function of Count Distinct is supported for only levels and member sets.
agg. Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.
Under the hood, the ROLLUP function moves from right to left decreasing the number of column expressions that it creates groups and aggregations on. Since the column order affects the ROLLUP output, it can also affect the number of rows returned in the result set. The GROUP BY ROLLUP can be written in one of two ways.
Again, line 3 is the rollup total for Canada, line 7 is the rollup total for the United States and line 8 is the "Grand" rollup total for lines 3 and 7. Another extension, or sub-clause, of the GROUP BY clause is the CUBE. The CUBE generates multiple grouping sets on your specified columns and aggregates them.
This is similar to the Group By Cube, but you will see the output is slightly different where we don't get as many rows returned for the second query. We can see the first query results are the same as Group By Rollup example, but the second query only returns 9 rows instead of 12 that we got in the Group By Rollup query.
The ROLLUP generates all grouping sets that make sense considering this hierarchy. This is why we often use ROLLUP to generate the subtotals and the grand total for reporting purposes.
Building on the idea of @ScottBoston (progressive aggregation, i.e., repeatedly aggregating on the previous aggregate result), we can do something that is relatively generic with regard to the aggregation function, if that function can be expressed as a composition of functions ((f3 ∘ f2 ∘ f2 ∘ ... ∘ f1)(x)
, or in other words: f3(f2(f2(...(f1(x)))))
).
For example, sum
works fine as is, because sum
is associative, so the sum of group sums is the sum of the whole.
For count
, the initial function (f1
) is indeed count
, but f2
has to be sum
, and the final f3
has to be identity.
For mean
, the initial function (f1
) has to produce two quantities: sum
and count
. The intermediary function f2
can be sum
, and the final function (f3
) has then to be the ratio of the two quantities.
Here is a rough template, with a few functions defined. As an added bonus, the function also optionally produces a grand total:
# think map-reduce: first map, then reduce (arbitrary number of times), then map to result
myfuncs = {
'sum': [sum, sum],
'prod': ['prod', 'prod'],
'count': ['count', sum],
'set': [set, lambda g: set.union(*g)],
'list': [list, sum],
'mean': [[sum, 'count'], sum, lambda r: r[0]/r[1]],
'var': [
[lambda x: (x**2).sum(), sum, 'count'],
sum,
lambda r: (r[0].sum() - r[1].sum()**2 / r[2]) / (r[2] - 1)],
'std': [
[lambda x: (x**2).sum(), sum, 'count'],
sum,
lambda r: np.sqrt((r[0].sum() - r[1].sum()**2 / r[2]) / (r[2] - 1))],
}
totalCol = '__total__'
def agg(df, cols, aggCol, fun, total=True):
if total:
cols = [totalCol] + cols
df = df.assign(__total__=0)
funs = myfuncs[fun]
b = df.groupby(cols).agg({aggCol: funs[0]})
frames = [b.reset_index()]
for k in range(1, len(cols)):
b = b.groupby(cols[:-k]).agg(funs[1])
frames.append(b.reset_index())
result = pd.concat(frames).reset_index(drop=True)
result = result[frames[0].columns]
if len(funs) > 2:
s = result[aggCol].apply(funs[2], axis=1)
result = result.drop(aggCol, axis=1, level=0)
result[aggCol] = s
result.columns = result.columns.droplevel(-1)
if total:
result = result.drop(columns=[totalCol])
return result
cols = list('QRST')
aggCol = 'P'
>>> agg(df, cols, aggCol, 'count')
Q R S T P
0 NR F HOL F 4
1 NR F NHOL F 5
2 NR M NHOL M 3
3 R F HOL F 1
.. ... ... ... ... ..
15 R M NaN NaN 1
16 NR NaN NaN NaN 12
17 R NaN NaN NaN 3
18 NaN NaN NaN NaN 15
>>> agg(df, cols, aggCol, 'mean')
Q R S T P
0 NR F HOL F 6.250000
1 NR F NHOL F 2.600000
2 NR M NHOL M 4.666667
3 R F HOL F 1.000000
.. ... ... ... ... ...
15 R M NaN NaN 0.000000
16 NR NaN NaN NaN 4.333333
17 R NaN NaN NaN 0.333333
18 NaN NaN NaN NaN 3.533333
>>> agg(df, cols, aggCol, 'sum')
Q R S T P
0 NR F HOL F 25
1 NR F NHOL F 13
2 NR M NHOL M 14
3 R F HOL F 1
.. ... ... ... ... ..
15 R M NaN NaN 0
16 NR NaN NaN NaN 52
17 R NaN NaN NaN 1
18 NaN NaN NaN NaN 53
>>> agg(df, cols, aggCol, 'set')
Q R S T P
0 NR F HOL F {9, 3, 6, 7}
1 NR F NHOL F {1, 2, 7}
2 NR M NHOL M {9, 2, 3}
3 R F HOL F {1}
.. ... ... ... ... ...
15 R M NaN NaN {0}
16 NR NaN NaN NaN {1, 2, 3, 6, 7, 9}
17 R NaN NaN NaN {0, 1}
18 NaN NaN NaN NaN {0, 1, 2, 3, 6, 7, 9}
>>> agg(df, cols, aggCol, 'std')
Q R S T P
0 NR F HOL F 2.500000
1 NR F NHOL F 2.509980
2 NR M NHOL M 3.785939
3 R F HOL F NaN
.. ... ... ... ... ...
15 R M NaN NaN NaN
16 NR NaN NaN NaN 3.055050
17 R NaN NaN NaN 0.577350
18 NaN NaN NaN NaN 3.181793
The code is not as 'pure' as I would like it to be. There are two reasons for that:
groupby
likes to do some magic to the shape of the result. For example, in some cases (but not always, strangely enough), if there is only one resulting group, the output is sometimes squeezed to a Series
.
the pandas arithmetic on set
seems sometimes bogus, or finicky at best. My initial definition had: 'set': [set, sum]
and this was working reasonably well (pandas seems to sometimes understand that .agg(sum)
on a Series
of set
objects, it is desirable to apply set.union
), except that, weirdly enough, in some conditions we'd get a NaN
result instead.
This only works for a single aggCol
.
The expressions for std
and var
are relatively naive. For improved numerical stability, see Standard Deviation: Rapid calculation methods.
Since the original posting of this answer, another solution has been proposed by @U12-Forward. After a bit of cleaning (e.g. not using recursion, and changing the agg dtype to whatever it needs to be, instead of object
, this solution becomes:
def v_u12(df, cols, aggCol, fun):
newdf = pd.DataFrame(columns=cols)
for count in range(1, len(cols)+1):
groupcols = cols[:count]
newdf = newdf.append(
df.groupby(groupcols)[aggCol].agg(fun).reset_index().reindex(columns=groupcols + [aggCol]),
ignore_index=True,
)
return newdf
To compare speed, let's generate DataFrames of arbitrary sizes:
def gen_example(n, m=4, seed=-1):
if seed >= 0:
np.random.seed(seed)
aggCol = 'v'
cols = list(ascii_uppercase)[:m]
choices = [['R', 'NR'], ['F', 'M'], ['HOL', 'NHOL']]
df = pd.DataFrame({
aggCol: np.random.uniform(size=n),
**{
k: np.random.choice(choices[np.random.randint(0, len(choices))], n)
for k in cols
}})
return df
# example
>>> gen_example(8, 5, 0)
v A B C D E
0 0.548814 R M F M NR
1 0.715189 R M M M R
2 0.602763 NR M M F R
3 0.544883 R F F M NR
4 0.423655 NR M F F NR
5 0.645894 NR F M M R
6 0.437587 R M F M NR
7 0.891773 R F M M R
We can now compare speed over a range of sizes, using the excellent perfplot
package, plus a few definitions:
m = 4
aggCol, *cols = gen_example(2, m).columns
fun = 'mean'
def ours(df):
funname = fun if isinstance(fun, str) else fun.__name__
return agg(df, cols, aggCol, funname, total=False)
def u12(df):
return v_u12(df, cols, aggCol, fun)
def equality_check(a, b):
a = a.sort_values(cols).reset_index(drop=True)
b = b.sort_values(cols).reset_index(drop=True)
non_numeric = a[aggCol].dtype == 'object'
if non_numeric:
return a[cols+[aggCol]].equals(b[cols+[aggCol]])
return a[cols].equals(b[cols]) and np.allclose(a[aggCol], b[aggCol])
perfplot.show(
time_unit='auto',
setup=lambda n: gen_example(n, m),
kernels=[ours, u12],
n_range=[2 ** k for k in range(4, 21)],
equality_check=equality_check,
xlabel=f'n rows\n(m={m} columns, fun={fun})'
)
The comparisons for a few aggregation functions and m
values are below (y-axis is average time: lower is better):
m |
fun |
perfplot |
---|---|---|
4 | 'mean' |
|
10 | 'mean' |
|
10 | 'sum' |
|
4 | 'set' |
For functions that are not associative (e.g. 'mean'
), our "progressive re-aggregation" needs to keep track of multiple values (e.g., for mean
: sum
and count
), so for relatively small DataFrames, the speed is about twice as slow as u12
. But as the size grows, the gain of the re-aggregation overcomes that and ours
becomes faster.
Maybe you could try this with recursion.
Like the below:
newdf = pd.DataFrame(columns=df.columns)
cols = list('QRST')
aggCol = 'P'
def aggregation(cols, origcols, aggCol, df, count=1):
global newdf
cols = origcols[:count]
count += 1
newdf = newdf.append(df.groupby(cols)[aggCol].agg('mean').round(2).reset_index().T.reindex(origcols + [aggCol]).T, ignore_index=True)
if cols != origcols:
aggregation(cols, origcols, aggCol, df, count)
aggregation(cols, cols, aggCol, df)
newdf['agg'] = newdf.pop(aggCol)
print(newdf)
Output:
Q R S T agg
0 NR NaN NaN NaN 4.33
1 R NaN NaN NaN 0.33
2 NR F NaN NaN 4.22
3 NR M NaN NaN 4.67
4 R F NaN NaN 0.5
5 R M NaN NaN 0
6 NR F HOL NaN 6.25
7 NR F NHOL NaN 2.6
8 NR M NHOL NaN 4.67
9 R F HOL NaN 1
10 R F NHOL NaN 0
11 R M NHOL NaN 0
12 NR F HOL F 6.25
13 NR F NHOL F 2.6
14 NR M NHOL M 4.67
15 R F HOL F 1
16 R F NHOL F 0
17 R M NHOL M 0
Timing with the following code (running it 5000 times):
import time
u11time1 = time.time()
for i in range(5000):
df = pd.read_clipboard()
newdf = pd.DataFrame(columns=df.columns)
cols = list('QRST')
aggCol = 'P'
def aggregation(cols, origcols, aggCol, df, count=1):
global newdf
cols = origcols[:count]
count += 1
newdf = newdf.append(df.groupby(cols)[aggCol].agg('mean').round(2).reset_index().T.reindex(origcols + [aggCol]).T, ignore_index=True)
if cols != origcols:
aggregation(cols, origcols, aggCol, df, count)
aggregation(cols, cols, aggCol, df)
newdf['agg'] = newdf.pop(aggCol)
u11time2 = time.time()
print('u11 time:', u11time2 - u11time1)
thepyguytime1 = time.time()
for i in range(5000):
df = pd.read_clipboard()
cols = list('QRST')
aggCol = 'P'
groupCols = []
result = []
for col in cols:
groupCols.append(col)
result.append(df.groupby(groupCols)[aggCol].agg(count='count').reset_index())
result = pd.concat(result)[groupCols+['count']]
thepyguytime2 = time.time()
print('ThePyGuy time:', thepyguytime2 - thepyguytime1)
Gives:
u11 time: 120.2678394317627
ThePyGuy time: 153.01533579826355
My code is faster by 33 seconds...
But if you run it only a few times, i.e. 10 times, my code usually still wins but not with a such a big margin. But for more iterations, i.e. 5000 times, my code performs much faster than your original for loop code.
I think this is a bit more efficient:
b = df.groupby(cols)[aggCol].count()
l = list(range(b.index.nlevels-1))
p = [b]
while l:
p.append(b.groupby(level=l).sum())
l.pop()
result = pd.concat(p)
Timings:
7.4 ms ± 55.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
vs original
20.7 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using sum instead of counting each all elements each time. Count all the elements once then sum for the levels for the index decreasing.
With mean
or averaging we can use @PierreD suggest with a sum and a count then aggregate:
from itertools import zip_longest
cols = list('QRST')
aggCol = 'P'
b = df.groupby(cols)[aggCol].agg(['sum', 'count'])
l = list(range(b.index.nlevels-1))
p = [b]
while l:
p.append(b.groupby(level=l).sum())
l.pop()
result = pd.concat(p)
result = result.assign(avg=result['sum']/result['count']).drop(['sum', 'count'], axis=1)
result
result.index = pd.MultiIndex.from_arrays(list(zip_longest(*result.index)))
Output:
avg
(NR, F, HOL, F) 6.250000
(NR, F, NHOL, F) 2.600000
(NR, M, NHOL, M) 4.666667
(R, F, HOL, F) 1.000000
(R, F, NHOL, F) 0.000000
(R, M, NHOL, M) 0.000000
(NR, F, HOL) 6.250000
(NR, F, NHOL) 2.600000
(NR, M, NHOL) 4.666667
(R, F, HOL) 1.000000
(R, F, NHOL) 0.000000
(R, M, NHOL) 0.000000
(NR, F) 4.222222
(NR, M) 4.666667
(R, F) 0.500000
(R, M) 0.000000
NR 4.333333
R 0.333333
Pandas' groupby feature is versatile, and can take custom functions. I am presenting the solution with a lambda that returns the count, but you could easily substitute np.min or np.max or other custom functions. Please bear in mind that any of these functions should make sense when applied recursively over the groupby's nesting levels (so count, min, max will all make sense; but if you have a statistical function such as mean, you will lose the information needed to calculate correct aggregates at higher groupings).
df=pd.DataFrame.from_records(
[['PLAC','NR','F','HOL','F'],
['PLAC','NR', 'F', 'NHOL', 'F'],
['TRTB','NR', 'M', 'NHOL', 'M'],
['PLAC','NR', 'M', 'NHOL', 'M'],
['PLAC','NR', 'F', 'NHOL', 'F'],
['PLAC','R', 'M', 'NHOL', 'M'],
['TRTA','R', 'F', 'HOL', 'F'],
['TRTA','NR', 'F', 'HOL', 'F'],
['TRTB','NR', 'F', 'NHOL', 'F'],
['PLAC','NR', 'F', 'NHOL', 'F'],
['TRTB','NR', 'F', 'NHOL', 'F'],
['TRTB','NR', 'M', 'NHOL', 'M'],
['TRTA','NR', 'F', 'HOL', 'F'],
['PLAC','NR', 'F', 'HOL', 'F'],
['PLAC','R', 'F', 'NHOL', 'F']],
columns = ['P','Q','R','S','T'])
First, define a groupby-dataframe using the most granular groupings:
grdf = df.groupby(['Q','R','S','T'])['P'].apply(lambda x:len(x)).to_frame()
Now use the unstack() method of the this dataframe to successively obtain aggregates at less granular grouping levels. For instance, at one level higher with index as ['Q','R','S']:
df2 = df.unstack()
result2 = df2.sum(axis=1).rename(str(df2.index.names)).to_frame()
result2 will look like this:
Similarly, compute aggregates at all grouping levels desired and append them all to the same dataframe using a function like this (ideally you can make this a recursive function, but I kept it simple so the flow can be easily seen):
def combine_aggregates(df):
#if type(grdf) == pd.core.frame.DataFrame:
df1 = df
result1 = df.sum(axis=1).rename(str(df1.index.names)).to_frame()
df2 = df1.unstack()
result2 = df2.sum(axis=1).rename(str(df2.index.names)).to_frame()
df3 = df2.unstack()
result3 = df3.sum(axis=1).rename(str(df3.index.names)).to_frame()
df4 = df3.unstack()
result4 = df4.sum(axis=1).rename(str(df4.index.names)).to_frame()
return result1.append(result2).append(result3).append(result4)
combine_aggregates(grdf)
And the final output will be:
Most of code is same except indexing and one extra engine argument
I preset the index, then groupby levels one at a time
Also for performance I try to use numba for numeric types Enhancingperf. It seems depending on size of df, you can add parallel, nogil options in numba.
Numba's first execution could be slow as it compiles, but subsequent execution should be faster
l = list('QRST')
df1 = df1.set_index(l)
result = [
df1.groupby(level=l[:i+1])['P'].agg(np.mean, engine='numba').round(2).reset_index()
for i in range(4)
]
pd.concat(result)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With