TL;DR: How can we achieve something similar to Group By Roll Up with any kind of aggregates in pandas? (Credit to @Scott Boston for this term) I have following dataframe: <pre class="prettyprint lang-py prettyprint-override"><code> P Q R S T 0 PLAC NR F HOL F 1 PLAC NR F NHOL F 2 TRTB NR M NHOL M 3 PLAC NR M NHOL M 4 PLAC NR F NHOL F 5 PLAC R M NHOL M 6 TRTA R F HOL F 7 TRTA NR F HOL F 8 TRTB NR F NHOL F 9 PLAC NR F NHOL F 10 TRTB NR F NHOL F 11 TRTB NR M NHOL M 12 TRTA NR F HOL F 13 PLAC NR F HOL F 14 PLAC R F NHOL F </code></pre> For a list of columns <code>['Q', 'R', 'S', 'T']</code>, I want to calculate some aggregates on <code>P</code> column on following 4 list of grouping columns: <ol> <li><code>['Q']</code></li> <li><code>['Q', 'R']</code></li> <li><code>['Q', 'R', 'S']</code></li> <li><code>['Q', 'R', 'S', 'T']</code></li> </ol> I've already written the code to group above dataframes in an increasing number of columns, and calculate the aggregate (using <code>count</code> for the shake of simplicity) on each of the groupby object, and finally concatenate them: <pre class="prettyprint lang-py prettyprint-override"><code>cols = list('QRST') aggCol = 'P' groupCols = [] result = [] for col in cols: groupCols.append(col) result.append(df.groupby(groupCols)[aggCol].agg(count='count').reset_index()) result = pd.concat(result)[groupCols+['count']] </code></pre> However, I've strong feeling that above method is not so efficient in terms of CPU time. Is there a more efficient way to apply aggregate on such continuously increasing number of columns for grouping? Why I think it is not so efficient is because: For above values, in first iteration, it groups the dataframe on <code>Q</code> column then calculates aggregate. Then in next iteration it groups the dataframe on <code>Q</code> and <code>R</code>, that means it again needs to group it by <code>Q</code> then <code>R</code>, but it was already grouped by <code>Q</code> in the first iteration, so the same operation is repeating. If there is some way to utilize the previously created groups, I think it'll be efficient. OUTPUT: <pre class="prettyprint lang-py prettyprint-override"><code> Q R S T count 0 NR NaN NaN NaN 12 1 R NaN NaN NaN 3 0 NR F NaN NaN 9 1 NR M NaN NaN 3 2 R F NaN NaN 2 3 R M NaN NaN 1 0 NR F HOL NaN 4 1 NR F NHOL NaN 5 2 NR M NHOL NaN 3 3 R F HOL NaN 1 4 R F NHOL NaN 1 5 R M NHOL NaN 1 0 NR F HOL F 4 1 NR F NHOL F 5 2 NR M NHOL M 3 3 R F HOL F 1 4 R F NHOL F 1 5 R M NHOL M 1 </code></pre> I already looked into Is there an equivalent of SQL GROUP BY ROLLUP in Python pandas? and Pandas Pivot tables row subtotals, they don't work in my case, I already tried them i.e. These method can be used to get the count only, and immediately fail even for unique counts when the same identifier appears for more than one values: <pre class="prettyprint lang-py prettyprint-override"><code>pd.pivot_table(df, aggCol, columns=cols, aggfunc='count', margins=True).T.reset_index() Q R S T P 0 NR F HOL F 4 1 NR F NHOL F 5 2 NR M NHOL M 3 3 NR All 3 4 R F HOL F 1 5 R F NHOL F 1 6 R M NHOL M 1 7 R All 3 </code></pre> <h3>UPDATE</h3> In order to avoid any unnecessary confusion with just getting the <code>count</code> as per suggestion in the comment, I have added it for the mean as aggregate, changing <code>P</code> column to a numeric type: <pre class="prettyprint lang-py prettyprint-override"><code> P Q R S T 0 9 NR F HOL F 1 7 NR F NHOL F 2 3 NR M NHOL M 3 9 NR M NHOL M 4 1 NR F NHOL F 5 0 R M NHOL M 6 1 R F HOL F 7 7 NR F HOL F 8 2 NR F NHOL F 9 2 NR F NHOL F 10 1 NR F NHOL F 11 2 NR M NHOL M 12 3 NR F HOL F 13 6 NR F HOL F 14 0 R F NHOL F cols = list('QRST') cols = list('QRST') aggCol = 'P' groupCols = [] result = [] for col in cols: groupCols.append(col) result.append(df.groupby(groupCols)[aggCol] .agg(agg=np.mean) .round(2).reset_index()) result = pd.concat(result)[groupCols+['agg']] </code></pre> <pre class="prettyprint lang-py prettyprint-override"><code>>>> result Q R S T agg 0 NR NaN NaN NaN 4.33 1 R NaN NaN NaN 0.33 0 NR F NaN NaN 4.22 1 NR M NaN NaN 4.67 2 R F NaN NaN 0.50 3 R M NaN NaN 0.00 0 NR F HOL NaN 6.25 1 NR F NHOL NaN 2.60 2 NR M NHOL NaN 4.67 3 R F HOL NaN 1.00 4 R F NHOL NaN 0.00 5 R M NHOL NaN 0.00 0 NR F HOL F 6.25 1 NR F NHOL F 2.60 2 NR M NHOL M 4.67 3 R F HOL F 1.00 4 R F NHOL F 0.00 5 R M NHOL M 0.00 </code></pre>

Building on the idea of @ScottBoston (progressive aggregation, i.e., repeatedly aggregating on the previous aggregate result), we can do something that is relatively generic with regard to the aggregation function, if that function can be expressed as a composition of functions (<code>(f3 ∘ f2 ∘ f2 ∘ ... ∘ f1)(x)</code>, or in other words: <code>f3(f2(f2(...(f1(x)))))</code>). For example, <code>sum</code> works fine as is, because <code>sum</code> is associative, so the sum of group sums is the sum of the whole. For <code>count</code>, the initial function (<code>f1</code>) is indeed <code>count</code>, but <code>f2</code> has to be <code>sum</code>, and the final <code>f3</code> has to be identity. For <code>mean</code>, the initial function (<code>f1</code>) has to produce two quantities: <code>sum</code> and <code>count</code>. The intermediary function <code>f2</code> can be <code>sum</code>, and the final function (<code>f3</code>) has then to be the ratio of the two quantities. Here is a rough template, with a few functions defined. As an added bonus, the function also optionally produces a grand total: <pre class="prettyprint lang-py prettyprint-override"><code># think map-reduce: first map, then reduce (arbitrary number of times), then map to result myfuncs = { 'sum': [sum, sum], 'prod': ['prod', 'prod'], 'count': ['count', sum], 'set': [set, lambda g: set.union(*g)], 'list': [list, sum], 'mean': [[sum, 'count'], sum, lambda r: r[0]/r[1]], 'var': [ [lambda x: (x**2).sum(), sum, 'count'], sum, lambda r: (r[0].sum() - r[1].sum()**2 / r[2]) / (r[2] - 1)], 'std': [ [lambda x: (x**2).sum(), sum, 'count'], sum, lambda r: np.sqrt((r[0].sum() - r[1].sum()**2 / r[2]) / (r[2] - 1))], } totalCol = '__total__' def agg(df, cols, aggCol, fun, total=True): if total: cols = [totalCol] + cols df = df.assign(__total__=0) funs = myfuncs[fun] b = df.groupby(cols).agg({aggCol: funs[0]}) frames = [b.reset_index()] for k in range(1, len(cols)): b = b.groupby(cols[:-k]).agg(funs[1]) frames.append(b.reset_index()) result = pd.concat(frames).reset_index(drop=True) result = result[frames[0].columns] if len(funs) > 2: s = result[aggCol].apply(funs[2], axis=1) result = result.drop(aggCol, axis=1, level=0) result[aggCol] = s result.columns = result.columns.droplevel(-1) if total: result = result.drop(columns=[totalCol]) return result </code></pre> <h3>Examples</h3> <pre class="prettyprint lang-py prettyprint-override"><code>cols = list('QRST') aggCol = 'P' >>> agg(df, cols, aggCol, 'count') Q R S T P 0 NR F HOL F 4 1 NR F NHOL F 5 2 NR M NHOL M 3 3 R F HOL F 1 .. ... ... ... ... .. 15 R M NaN NaN 1 16 NR NaN NaN NaN 12 17 R NaN NaN NaN 3 18 NaN NaN NaN NaN 15 </code></pre> <pre class="prettyprint lang-py prettyprint-override"><code>>>> agg(df, cols, aggCol, 'mean') Q R S T P 0 NR F HOL F 6.250000 1 NR F NHOL F 2.600000 2 NR M NHOL M 4.666667 3 R F HOL F 1.000000 .. ... ... ... ... ... 15 R M NaN NaN 0.000000 16 NR NaN NaN NaN 4.333333 17 R NaN NaN NaN 0.333333 18 NaN NaN NaN NaN 3.533333 </code></pre> <pre class="prettyprint lang-py prettyprint-override"><code>>>> agg(df, cols, aggCol, 'sum') Q R S T P 0 NR F HOL F 25 1 NR F NHOL F 13 2 NR M NHOL M 14 3 R F HOL F 1 .. ... ... ... ... .. 15 R M NaN NaN 0 16 NR NaN NaN NaN 52 17 R NaN NaN NaN 1 18 NaN NaN NaN NaN 53 </code></pre> <pre class="prettyprint lang-py prettyprint-override"><code>>>> agg(df, cols, aggCol, 'set') Q R S T P 0 NR F HOL F {9, 3, 6, 7} 1 NR F NHOL F {1, 2, 7} 2 NR M NHOL M {9, 2, 3} 3 R F HOL F {1} .. ... ... ... ... ... 15 R M NaN NaN {0} 16 NR NaN NaN NaN {1, 2, 3, 6, 7, 9} 17 R NaN NaN NaN {0, 1} 18 NaN NaN NaN NaN {0, 1, 2, 3, 6, 7, 9} </code></pre> <pre class="prettyprint lang-py prettyprint-override"><code>>>> agg(df, cols, aggCol, 'std') Q R S T P 0 NR F HOL F 2.500000 1 NR F NHOL F 2.509980 2 NR M NHOL M 3.785939 3 R F HOL F NaN .. ... ... ... ... ... 15 R M NaN NaN NaN 16 NR NaN NaN NaN 3.055050 17 R NaN NaN NaN 0.577350 18 NaN NaN NaN NaN 3.181793 </code></pre> <h3>Notes</h3> <ul> <li> The code is not as 'pure' as I would like it to be. There are two reasons for that: <ol> <li> <code>groupby</code> likes to do some magic to the shape of the result. For example, in some cases (but not always, strangely enough), if there is only one resulting group, the output is sometimes squeezed to a <code>Series</code>. </li> <li> the pandas arithmetic on <code>set</code> seems sometimes bogus, or finicky at best. My initial definition had: <code>'set': [set, sum]</code> and this was working reasonably well (pandas seems to sometimes understand that <code>.agg(sum)</code> on a <code>Series</code> of <code>set</code> objects, it is desirable to apply <code>set.union</code>), except that, weirdly enough, in some conditions we'd get a <code>NaN</code> result instead. </li> </ol> </li> <li> This only works for a single <code>aggCol</code>. </li> <li> The expressions for <code>std</code> and <code>var</code> are relatively naive. For improved numerical stability, see Standard Deviation: Rapid calculation methods. </li> </ul> <h3>Speed</h3> Since the original posting of this answer, another solution has been proposed by @U12-Forward. After a bit of cleaning (e.g. not using recursion, and changing the agg dtype to whatever it needs to be, instead of <code>object</code>, this solution becomes: <pre class="prettyprint lang-py prettyprint-override"><code>def v_u12(df, cols, aggCol, fun): newdf = pd.DataFrame(columns=cols) for count in range(1, len(cols)+1): groupcols = cols[:count] newdf = newdf.append( df.groupby(groupcols)[aggCol].agg(fun).reset_index().reindex(columns=groupcols + [aggCol]), ignore_index=True, ) return newdf </code></pre> To compare speed, let's generate DataFrames of arbitrary sizes: <pre class="prettyprint lang-py prettyprint-override"><code>def gen_example(n, m=4, seed=-1): if seed >= 0: np.random.seed(seed) aggCol = 'v' cols = list(ascii_uppercase)[:m] choices = [['R', 'NR'], ['F', 'M'], ['HOL', 'NHOL']] df = pd.DataFrame({ aggCol: np.random.uniform(size=n), **{ k: np.random.choice(choices[np.random.randint(0, len(choices))], n) for k in cols }}) return df # example >>> gen_example(8, 5, 0) v A B C D E 0 0.548814 R M F M NR 1 0.715189 R M M M R 2 0.602763 NR M M F R 3 0.544883 R F F M NR 4 0.423655 NR M F F NR 5 0.645894 NR F M M R 6 0.437587 R M F M NR 7 0.891773 R F M M R </code></pre> We can now compare speed over a range of sizes, using the excellent <code>perfplot</code> package, plus a few definitions: <pre class="prettyprint lang-py prettyprint-override"><code>m = 4 aggCol, *cols = gen_example(2, m).columns fun = 'mean' def ours(df): funname = fun if isinstance(fun, str) else fun.__name__ return agg(df, cols, aggCol, funname, total=False) def u12(df): return v_u12(df, cols, aggCol, fun) def equality_check(a, b): a = a.sort_values(cols).reset_index(drop=True) b = b.sort_values(cols).reset_index(drop=True) non_numeric = a[aggCol].dtype == 'object' if non_numeric: return a[cols+[aggCol]].equals(b[cols+[aggCol]]) return a[cols].equals(b[cols]) and np.allclose(a[aggCol], b[aggCol]) perfplot.show( time_unit='auto', setup=lambda n: gen_example(n, m), kernels=[ours, u12], n_range=[2 ** k for k in range(4, 21)], equality_check=equality_check, xlabel=f'n rows\n(m={m} columns, fun={fun})' ) </code></pre> The comparisons for a few aggregation functions and <code>m</code> values are below (y-axis is average time: lower is better): <div class="s-table-container"> <table class="s-table"> <thead><tr> <th style="text-align: left;"><code>m</code></th> <th style="text-align: left;"><code>fun</code></th> <th style="text-align: left;">perfplot</th> </tr></thead> <tbody> <tr> <td style="text-align: left;">4</td> <td style="text-align: left;"><code>'mean'</code></td> <td style="text-align: left;"><img src="https://i.stack.imgur.com/c4ZI2.png" width="350"></td> </tr> <tr> <td style="text-align: left;">10</td> <td style="text-align: left;"><code>'mean'</code></td> <td style="text-align: left;"><img src="https://i.stack.imgur.com/Kqiaj.png" width="350"></td> </tr> <tr> <td style="text-align: left;">10</td> <td style="text-align: left;"><code>'sum'</code></td> <td style="text-align: left;"><img src="https://i.stack.imgur.com/0BdQe.png" width="350"></td> </tr> <tr> <td style="text-align: left;">4</td> <td style="text-align: left;"><code>'set'</code></td> <td style="text-align: left;"><img src="https://i.stack.imgur.com/JkUn7.png" width="350"></td> </tr> </tbody> </table> </div> For functions that are not associative (e.g. <code>'mean'</code>), our "progressive re-aggregation" needs to keep track of multiple values (e.g., for <code>mean</code>: <code>sum</code> and <code>count</code>), so for relatively small DataFrames, the speed is about twice as slow as <code>u12</code>. But as the size grows, the gain of the re-aggregation overcomes that and <code>ours</code> becomes faster.

I think this is a bit more efficient: <pre class="prettyprint"><code>b = df.groupby(cols)[aggCol].count() l = list(range(b.index.nlevels-1)) p = [b] while l: p.append(b.groupby(level=l).sum()) l.pop() result = pd.concat(p) </code></pre> Timings: <blockquote> 7.4 ms ± 55.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) </blockquote> vs original <blockquote> 20.7 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) </blockquote> Using sum instead of counting each all elements each time. Count all the elements once then sum for the levels for the index decreasing. <hr> With <code>mean</code> or averaging we can use @PierreD suggest with a sum and a count then aggregate: <pre class="prettyprint"><code>from itertools import zip_longest cols = list('QRST') aggCol = 'P' b = df.groupby(cols)[aggCol].agg(['sum', 'count']) l = list(range(b.index.nlevels-1)) p = [b] while l: p.append(b.groupby(level=l).sum()) l.pop() result = pd.concat(p) result = result.assign(avg=result['sum']/result['count']).drop(['sum', 'count'], axis=1) result </code></pre> <strike>result.index = pd.MultiIndex.from_arrays(list(zip_longest(*result.index)))</strike> Output: <pre class="prettyprint"><code> avg (NR, F, HOL, F) 6.250000 (NR, F, NHOL, F) 2.600000 (NR, M, NHOL, M) 4.666667 (R, F, HOL, F) 1.000000 (R, F, NHOL, F) 0.000000 (R, M, NHOL, M) 0.000000 (NR, F, HOL) 6.250000 (NR, F, NHOL) 2.600000 (NR, M, NHOL) 4.666667 (R, F, HOL) 1.000000 (R, F, NHOL) 0.000000 (R, M, NHOL) 0.000000 (NR, F) 4.222222 (NR, M) 4.666667 (R, F) 0.500000 (R, M) 0.000000 NR 4.333333 R 0.333333 </code></pre>

Most of code is same except indexing and one extra engine argument I preset the index, then groupby levels one at a time Also for performance I try to use numba for numeric types Enhancingperf. It seems depending on size of df, you can add parallel, nogil options in numba. Numba's first execution could be slow as it compiles, but subsequent execution should be faster <pre class="prettyprint"><code>l = list('QRST') df1 = df1.set_index(l) result = [ df1.groupby(level=l[:i+1])['P'].agg(np.mean, engine='numba').round(2).reset_index() for i in range(4) ] pd.concat(result) </code></pre>

Groupby Roll up or Roll Down for any kind of aggregates

Tags:

python

pandas

dataframe

pandas-groupby

TL;DR: How can we achieve something similar to Group By Roll Up with any kind of aggregates in pandas? (Credit to @Scott Boston for this term)

I have following dataframe:

       P   Q  R     S  T
0   PLAC  NR  F   HOL  F
1   PLAC  NR  F  NHOL  F
2   TRTB  NR  M  NHOL  M
3   PLAC  NR  M  NHOL  M
4   PLAC  NR  F  NHOL  F
5   PLAC   R  M  NHOL  M
6   TRTA   R  F   HOL  F
7   TRTA  NR  F   HOL  F
8   TRTB  NR  F  NHOL  F
9   PLAC  NR  F  NHOL  F
10  TRTB  NR  F  NHOL  F
11  TRTB  NR  M  NHOL  M
12  TRTA  NR  F   HOL  F
13  PLAC  NR  F   HOL  F
14  PLAC   R  F  NHOL  F

For a list of columns ['Q', 'R', 'S', 'T'], I want to calculate some aggregates on P column on following 4 list of grouping columns:

['Q']
['Q', 'R']
['Q', 'R', 'S']
['Q', 'R', 'S', 'T']

I've already written the code to group above dataframes in an increasing number of columns, and calculate the aggregate (using count for the shake of simplicity) on each of the groupby object, and finally concatenate them:

cols = list('QRST')
aggCol = 'P'
groupCols = []
result = []
for col in cols:
    groupCols.append(col)
    result.append(df.groupby(groupCols)[aggCol].agg(count='count').reset_index())
result = pd.concat(result)[groupCols+['count']]

However, I've strong feeling that above method is not so efficient in terms of CPU time. Is there a more efficient way to apply aggregate on such continuously increasing number of columns for grouping?

Why I think it is not so efficient is because: For above values, in first iteration, it groups the dataframe on Q column then calculates aggregate. Then in next iteration it groups the dataframe on Q and R, that means it again needs to group it by Q then R, but it was already grouped by Q in the first iteration, so the same operation is repeating. If there is some way to utilize the previously created groups, I think it'll be efficient.

OUTPUT:

    Q    R     S    T  count
0  NR  NaN   NaN  NaN     12
1   R  NaN   NaN  NaN      3
0  NR    F   NaN  NaN      9
1  NR    M   NaN  NaN      3
2   R    F   NaN  NaN      2
3   R    M   NaN  NaN      1
0  NR    F   HOL  NaN      4
1  NR    F  NHOL  NaN      5
2  NR    M  NHOL  NaN      3
3   R    F   HOL  NaN      1
4   R    F  NHOL  NaN      1
5   R    M  NHOL  NaN      1
0  NR    F   HOL    F      4
1  NR    F  NHOL    F      5
2  NR    M  NHOL    M      3
3   R    F   HOL    F      1
4   R    F  NHOL    F      1
5   R    M  NHOL    M      1

I already looked into Is there an equivalent of SQL GROUP BY ROLLUP in Python pandas? and Pandas Pivot tables row subtotals, they don't work in my case, I already tried them i.e. These method can be used to get the count only, and immediately fail even for unique counts when the same identifier appears for more than one values:

pd.pivot_table(df, aggCol, columns=cols, aggfunc='count', margins=True).T.reset_index()
    Q    R     S  T  P
0  NR    F   HOL  F  4
1  NR    F  NHOL  F  5
2  NR    M  NHOL  M  3
3  NR  All           3
4   R    F   HOL  F  1
5   R    F  NHOL  F  1
6   R    M  NHOL  M  1
7   R  All           3

UPDATE

In order to avoid any unnecessary confusion with just getting the count as per suggestion in the comment, I have added it for the mean as aggregate, changing P column to a numeric type:

    P   Q  R     S  T
0   9  NR  F   HOL  F
1   7  NR  F  NHOL  F
2   3  NR  M  NHOL  M
3   9  NR  M  NHOL  M
4   1  NR  F  NHOL  F
5   0   R  M  NHOL  M
6   1   R  F   HOL  F
7   7  NR  F   HOL  F
8   2  NR  F  NHOL  F
9   2  NR  F  NHOL  F
10  1  NR  F  NHOL  F
11  2  NR  M  NHOL  M
12  3  NR  F   HOL  F
13  6  NR  F   HOL  F
14  0   R  F  NHOL  F

cols = list('QRST')
cols = list('QRST')
aggCol = 'P'
groupCols = []
result = []
for col in cols:
    groupCols.append(col)
    result.append(df.groupby(groupCols)[aggCol]
                  .agg(agg=np.mean)
                  .round(2).reset_index())
result = pd.concat(result)[groupCols+['agg']]

>>> result
    Q    R     S    T   agg
0  NR  NaN   NaN  NaN  4.33
1   R  NaN   NaN  NaN  0.33
0  NR    F   NaN  NaN  4.22
1  NR    M   NaN  NaN  4.67
2   R    F   NaN  NaN  0.50
3   R    M   NaN  NaN  0.00
0  NR    F   HOL  NaN  6.25
1  NR    F  NHOL  NaN  2.60
2  NR    M  NHOL  NaN  4.67
3   R    F   HOL  NaN  1.00
4   R    F  NHOL  NaN  0.00
5   R    M  NHOL  NaN  0.00
0  NR    F   HOL    F  6.25
1  NR    F  NHOL    F  2.60
2  NR    M  NHOL    M  4.67
3   R    F   HOL    F  1.00
4   R    F  NHOL    F  0.00
5   R    M  NHOL    M  0.00

566

asked Aug 17 '21 19:08

ThePyGuy

5 Answers

Building on the idea of @ScottBoston (progressive aggregation, i.e., repeatedly aggregating on the previous aggregate result), we can do something that is relatively generic with regard to the aggregation function, if that function can be expressed as a composition of functions ((f3 ∘ f2 ∘ f2 ∘ ... ∘ f1)(x), or in other words: f3(f2(f2(...(f1(x)))))).

For example, sum works fine as is, because sum is associative, so the sum of group sums is the sum of the whole.

For count, the initial function (f1) is indeed count, but f2 has to be sum, and the final f3 has to be identity.

For mean, the initial function (f1) has to produce two quantities: sum and count. The intermediary function f2 can be sum, and the final function (f3) has then to be the ratio of the two quantities.

Here is a rough template, with a few functions defined. As an added bonus, the function also optionally produces a grand total:

# think map-reduce: first map, then reduce (arbitrary number of times), then map to result

myfuncs = {
    'sum': [sum, sum],
    'prod': ['prod', 'prod'],
    'count': ['count', sum],
    'set': [set, lambda g: set.union(*g)],
    'list': [list, sum],
    'mean': [[sum, 'count'], sum, lambda r: r[0]/r[1]],
    'var': [
        [lambda x: (x**2).sum(), sum, 'count'],
        sum,
        lambda r: (r[0].sum() - r[1].sum()**2 / r[2]) / (r[2] - 1)],
    'std': [
        [lambda x: (x**2).sum(), sum, 'count'],
        sum,
        lambda r: np.sqrt((r[0].sum() - r[1].sum()**2 / r[2]) / (r[2] - 1))],
}

totalCol = '__total__'
def agg(df, cols, aggCol, fun, total=True):
    if total:
        cols = [totalCol] + cols
        df = df.assign(__total__=0)
    funs = myfuncs[fun]
    b = df.groupby(cols).agg({aggCol: funs[0]})
    frames = [b.reset_index()]
    for k in range(1, len(cols)):
        b = b.groupby(cols[:-k]).agg(funs[1])
        frames.append(b.reset_index())
    result = pd.concat(frames).reset_index(drop=True)
    result = result[frames[0].columns]
    if len(funs) > 2:
        s = result[aggCol].apply(funs[2], axis=1)
        result = result.drop(aggCol, axis=1, level=0)
        result[aggCol] = s
        result.columns = result.columns.droplevel(-1)
    if total:
        result = result.drop(columns=[totalCol])
    return result

Examples

cols = list('QRST')
aggCol = 'P'

>>> agg(df, cols, aggCol, 'count')
      Q    R     S    T   P
0    NR    F   HOL    F   4
1    NR    F  NHOL    F   5
2    NR    M  NHOL    M   3
3     R    F   HOL    F   1
..  ...  ...   ...  ...  ..
15    R    M   NaN  NaN   1
16   NR  NaN   NaN  NaN  12
17    R  NaN   NaN  NaN   3
18  NaN  NaN   NaN  NaN  15

>>> agg(df, cols, aggCol, 'mean')
      Q    R     S    T         P
0    NR    F   HOL    F  6.250000
1    NR    F  NHOL    F  2.600000
2    NR    M  NHOL    M  4.666667
3     R    F   HOL    F  1.000000
..  ...  ...   ...  ...       ...
15    R    M   NaN  NaN  0.000000
16   NR  NaN   NaN  NaN  4.333333
17    R  NaN   NaN  NaN  0.333333
18  NaN  NaN   NaN  NaN  3.533333

>>> agg(df, cols, aggCol, 'sum')
      Q    R     S    T   P
0    NR    F   HOL    F  25
1    NR    F  NHOL    F  13
2    NR    M  NHOL    M  14
3     R    F   HOL    F   1
..  ...  ...   ...  ...  ..
15    R    M   NaN  NaN   0
16   NR  NaN   NaN  NaN  52
17    R  NaN   NaN  NaN   1
18  NaN  NaN   NaN  NaN  53

>>> agg(df, cols, aggCol, 'set')
      Q    R     S    T                      P
0    NR    F   HOL    F           {9, 3, 6, 7}
1    NR    F  NHOL    F              {1, 2, 7}
2    NR    M  NHOL    M              {9, 2, 3}
3     R    F   HOL    F                    {1}
..  ...  ...   ...  ...                    ...
15    R    M   NaN  NaN                    {0}
16   NR  NaN   NaN  NaN     {1, 2, 3, 6, 7, 9}
17    R  NaN   NaN  NaN                 {0, 1}
18  NaN  NaN   NaN  NaN  {0, 1, 2, 3, 6, 7, 9}

>>> agg(df, cols, aggCol, 'std')
      Q    R     S    T         P
0    NR    F   HOL    F  2.500000
1    NR    F  NHOL    F  2.509980
2    NR    M  NHOL    M  3.785939
3     R    F   HOL    F       NaN
..  ...  ...   ...  ...       ...
15    R    M   NaN  NaN       NaN
16   NR  NaN   NaN  NaN  3.055050
17    R  NaN   NaN  NaN  0.577350
18  NaN  NaN   NaN  NaN  3.181793

Notes

The code is not as 'pure' as I would like it to be. There are two reasons for that:
1. groupby likes to do some magic to the shape of the result. For example, in some cases (but not always, strangely enough), if there is only one resulting group, the output is sometimes squeezed to a Series.
2. the pandas arithmetic on set seems sometimes bogus, or finicky at best. My initial definition had: 'set': [set, sum] and this was working reasonably well (pandas seems to sometimes understand that .agg(sum) on a Series of set objects, it is desirable to apply set.union), except that, weirdly enough, in some conditions we'd get a NaN result instead.
This only works for a single aggCol.
The expressions for std and var are relatively naive. For improved numerical stability, see Standard Deviation: Rapid calculation methods.

Speed

Since the original posting of this answer, another solution has been proposed by @U12-Forward. After a bit of cleaning (e.g. not using recursion, and changing the agg dtype to whatever it needs to be, instead of object, this solution becomes:

def v_u12(df, cols, aggCol, fun):
    newdf = pd.DataFrame(columns=cols)
    for count in range(1, len(cols)+1):
        groupcols = cols[:count]
        newdf = newdf.append(
            df.groupby(groupcols)[aggCol].agg(fun).reset_index().reindex(columns=groupcols + [aggCol]),
            ignore_index=True,
        )
    return newdf

To compare speed, let's generate DataFrames of arbitrary sizes:

def gen_example(n, m=4, seed=-1):
    if seed >= 0:
        np.random.seed(seed)
    aggCol = 'v'
    cols = list(ascii_uppercase)[:m]
    choices = [['R', 'NR'], ['F', 'M'], ['HOL', 'NHOL']]
    df = pd.DataFrame({
        aggCol: np.random.uniform(size=n),
        **{
            k: np.random.choice(choices[np.random.randint(0, len(choices))], n)
            for k in cols
        }})
    return df

# example
>>> gen_example(8, 5, 0)
          v   A  B  C  D   E
0  0.548814   R  M  F  M  NR
1  0.715189   R  M  M  M   R
2  0.602763  NR  M  M  F   R
3  0.544883   R  F  F  M  NR
4  0.423655  NR  M  F  F  NR
5  0.645894  NR  F  M  M   R
6  0.437587   R  M  F  M  NR
7  0.891773   R  F  M  M   R

We can now compare speed over a range of sizes, using the excellent perfplot package, plus a few definitions:

m = 4
aggCol, *cols = gen_example(2, m).columns
fun = 'mean'

def ours(df):
    funname = fun if isinstance(fun, str) else fun.__name__
    return agg(df, cols, aggCol, funname, total=False)

def u12(df):
    return v_u12(df, cols, aggCol, fun)

def equality_check(a, b):
    a = a.sort_values(cols).reset_index(drop=True)
    b = b.sort_values(cols).reset_index(drop=True)
    non_numeric = a[aggCol].dtype == 'object'
    if non_numeric:
        return a[cols+[aggCol]].equals(b[cols+[aggCol]])
    return a[cols].equals(b[cols]) and np.allclose(a[aggCol], b[aggCol])


perfplot.show(
    time_unit='auto',
    setup=lambda n: gen_example(n, m),
    kernels=[ours, u12],
    n_range=[2 ** k for k in range(4, 21)],
    equality_check=equality_check,
    xlabel=f'n rows\n(m={m} columns, fun={fun})'
)

The comparisons for a few aggregation functions and m values are below (y-axis is average time: lower is better):

`m`	`fun`	perfplot
4	`'mean'`
10	`'mean'`
10	`'sum'`
4	`'set'`

For functions that are not associative (e.g. 'mean'), our "progressive re-aggregation" needs to keep track of multiple values (e.g., for mean: sum and count), so for relatively small DataFrames, the speed is about twice as slow as u12. But as the size grows, the gain of the re-aggregation overcomes that and ours becomes faster.

answered Oct 23 '22 04:10

Pierre D

Solution:

Maybe you could try this with recursion.

Like the below:

newdf = pd.DataFrame(columns=df.columns)
cols = list('QRST')
aggCol = 'P'
def aggregation(cols, origcols, aggCol, df, count=1):
    global newdf
    cols = origcols[:count]
    count += 1
    newdf = newdf.append(df.groupby(cols)[aggCol].agg('mean').round(2).reset_index().T.reindex(origcols + [aggCol]).T, ignore_index=True)
    if cols != origcols:
        aggregation(cols, origcols, aggCol, df, count)

aggregation(cols, cols, aggCol, df)
newdf['agg'] = newdf.pop(aggCol)
print(newdf)

Output:

     Q    R     S    T   agg
0   NR  NaN   NaN  NaN  4.33
1    R  NaN   NaN  NaN  0.33
2   NR    F   NaN  NaN  4.22
3   NR    M   NaN  NaN  4.67
4    R    F   NaN  NaN   0.5
5    R    M   NaN  NaN     0
6   NR    F   HOL  NaN  6.25
7   NR    F  NHOL  NaN   2.6
8   NR    M  NHOL  NaN  4.67
9    R    F   HOL  NaN     1
10   R    F  NHOL  NaN     0
11   R    M  NHOL  NaN     0
12  NR    F   HOL    F  6.25
13  NR    F  NHOL    F   2.6
14  NR    M  NHOL    M  4.67
15   R    F   HOL    F     1
16   R    F  NHOL    F     0
17   R    M  NHOL    M     0

Timings:

Timing with the following code (running it 5000 times):

import time

u11time1 = time.time()

for i in range(5000):
    df = pd.read_clipboard()
    newdf = pd.DataFrame(columns=df.columns)
    cols = list('QRST')
    aggCol = 'P'
    def aggregation(cols, origcols, aggCol, df, count=1):
        global newdf
        cols = origcols[:count]
        count += 1
        newdf = newdf.append(df.groupby(cols)[aggCol].agg('mean').round(2).reset_index().T.reindex(origcols + [aggCol]).T, ignore_index=True)
        if cols != origcols:
            aggregation(cols, origcols, aggCol, df, count)

    aggregation(cols, cols, aggCol, df)
    newdf['agg'] = newdf.pop(aggCol)

u11time2 = time.time()

print('u11 time:', u11time2 - u11time1)

thepyguytime1 = time.time()

for i in range(5000):
    df = pd.read_clipboard()
    cols = list('QRST')
    aggCol = 'P'
    groupCols = []
    result = []
    for col in cols:
        groupCols.append(col)
        result.append(df.groupby(groupCols)[aggCol].agg(count='count').reset_index())
    result = pd.concat(result)[groupCols+['count']]

thepyguytime2 = time.time()

print('ThePyGuy time:', thepyguytime2 - thepyguytime1)

Gives:

u11 time: 120.2678394317627
ThePyGuy time: 153.01533579826355

My code is faster by 33 seconds...

But if you run it only a few times, i.e. 10 times, my code usually still wins but not with a such a big margin. But for more iterations, i.e. 5000 times, my code performs much faster than your original for loop code.

Conclusion is: My solution runs faster :)

answered Oct 23 '22 03:10

U12-Forward

I think this is a bit more efficient:

b = df.groupby(cols)[aggCol].count()
l = list(range(b.index.nlevels-1))
p = [b]
while l:
    p.append(b.groupby(level=l).sum())
    l.pop()

result = pd.concat(p)

Timings:

7.4 ms ± 55.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

vs original

20.7 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Using sum instead of counting each all elements each time. Count all the elements once then sum for the levels for the index decreasing.

With mean or averaging we can use @PierreD suggest with a sum and a count then aggregate:

from itertools import zip_longest
cols = list('QRST')
aggCol = 'P'
b = df.groupby(cols)[aggCol].agg(['sum', 'count'])
l = list(range(b.index.nlevels-1))
p = [b]
while l:
    p.append(b.groupby(level=l).sum())
    l.pop()

result = pd.concat(p)
result = result.assign(avg=result['sum']/result['count']).drop(['sum', 'count'], axis=1)
result

~~result.index = pd.MultiIndex.from_arrays(list(zip_longest(*result.index)))~~

Output:

                       avg
(NR, F, HOL, F)   6.250000
(NR, F, NHOL, F)  2.600000
(NR, M, NHOL, M)  4.666667
(R, F, HOL, F)    1.000000
(R, F, NHOL, F)   0.000000
(R, M, NHOL, M)   0.000000
(NR, F, HOL)      6.250000
(NR, F, NHOL)     2.600000
(NR, M, NHOL)     4.666667
(R, F, HOL)       1.000000
(R, F, NHOL)      0.000000
(R, M, NHOL)      0.000000
(NR, F)           4.222222
(NR, M)           4.666667
(R, F)            0.500000
(R, M)            0.000000
NR                4.333333
R                 0.333333

answered Oct 23 '22 03:10

Scott Boston

Pandas' groupby feature is versatile, and can take custom functions. I am presenting the solution with a lambda that returns the count, but you could easily substitute np.min or np.max or other custom functions. Please bear in mind that any of these functions should make sense when applied recursively over the groupby's nesting levels (so count, min, max will all make sense; but if you have a statistical function such as mean, you will lose the information needed to calculate correct aggregates at higher groupings).

df=pd.DataFrame.from_records(
[['PLAC','NR','F','HOL','F'],
['PLAC','NR',  'F',  'NHOL',  'F'],
['TRTB','NR',  'M',  'NHOL',  'M'],
['PLAC','NR',  'M',  'NHOL',  'M'],
['PLAC','NR',  'F',  'NHOL',  'F'],
['PLAC','R', 'M', 'NHOL',  'M'],
['TRTA','R',  'F',   'HOL',  'F'],
['TRTA','NR',  'F',   'HOL',  'F'],
['TRTB','NR',  'F',  'NHOL',  'F'],
['PLAC','NR',  'F',  'NHOL',  'F'],
['TRTB','NR',  'F',  'NHOL',  'F'],
['TRTB','NR',  'M',  'NHOL',  'M'],
['TRTA','NR',  'F',   'HOL',  'F'],
['PLAC','NR',  'F',   'HOL',  'F'],
['PLAC','R',  'F',  'NHOL',  'F']],
columns = ['P','Q','R','S','T'])

First, define a groupby-dataframe using the most granular groupings:

grdf = df.groupby(['Q','R','S','T'])['P'].apply(lambda x:len(x)).to_frame()

Now use the unstack() method of the this dataframe to successively obtain aggregates at less granular grouping levels. For instance, at one level higher with index as ['Q','R','S']:

df2 = df.unstack()
result2 = df2.sum(axis=1).rename(str(df2.index.names)).to_frame()

result2 will look like this:

enter image description here

Similarly, compute aggregates at all grouping levels desired and append them all to the same dataframe using a function like this (ideally you can make this a recursive function, but I kept it simple so the flow can be easily seen):

def combine_aggregates(df):
    #if type(grdf) == pd.core.frame.DataFrame:
    df1 = df
    result1 = df.sum(axis=1).rename(str(df1.index.names)).to_frame()
    df2 = df1.unstack()
    result2 = df2.sum(axis=1).rename(str(df2.index.names)).to_frame()
    df3 = df2.unstack()
    result3 = df3.sum(axis=1).rename(str(df3.index.names)).to_frame()
    df4 = df3.unstack()
    result4 = df4.sum(axis=1).rename(str(df4.index.names)).to_frame()

    return result1.append(result2).append(result3).append(result4)



combine_aggregates(grdf)

And the final output will be:

enter image description here

answered Oct 23 '22 03:10

HMReliable

Most of code is same except indexing and one extra engine argument

I preset the index, then groupby levels one at a time

Also for performance I try to use numba for numeric types Enhancingperf. It seems depending on size of df, you can add parallel, nogil options in numba.

Numba's first execution could be slow as it compiles, but subsequent execution should be faster

l = list('QRST')
df1 = df1.set_index(l)
result = [
    df1.groupby(level=l[:i+1])['P'].agg(np.mean, engine='numba').round(2).reset_index()
    for i in range(4)
]
pd.concat(result)

answered Oct 23 '22 04:10

eroot163pi

Related questions
                            
                                Parsing a "simple" grammar
                            
                                Parse dates from before the year 1000
                            
                                fft division for fast polynomial division
                            
                                How can I know which commit was used when installing a pip package from git?
                            
                                Deque random acces O(n) in python while O(1) in C++, why? [duplicate]
                            
                                How to insert text at line and column position in a file?
                            
                                matplotlib surface plot hides scatter points which should be in front
                            
                                Python Google Cloud Function Connection reset by peer
                            
                                Pytest: setup testclient and DB
                            
                                aiohttp how to save a persistent ClientSession in a class?
                            
                                Saving an sklearn `FunctionTransformer` with the function it wraps
                            
                                Add suffix to all column names when using pandas.merge()
                            
                                Pandas style.background_gradient ignore NaN
                            
                                Pythonic way to find the last position in a string matching a negative regex
                            
                                Equivalent for `--find-links` in `setup.py`
                            
                                exec() not working with unicode characters
                            
                                Use multiple databases in Django with only one table "django_migrations"
                            
                                'AutoTrackable' object is not callable in Python
                            
                                PyTorch multiprocessing error with Hogwild
                            
                                How to remove hidden marks from images using python opencv?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Groupby Roll up or Roll Down for any kind of aggregates

Tags:

python

pandas

dataframe

pandas-groupby

UPDATE

ThePyGuy

People also ask

5 Answers

Examples

Notes

Speed

Pierre D

Solution:

Timings:

Conclusion is: My solution runs faster :)

U12-Forward

Scott Boston

HMReliable

eroot163pi

Recent Activity

Donate For Us