Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is There Complete Overlap Between `pd.pivot_table` and `pd.DataFrame.groupby` + `pd.DataFrame.unstack`?

(Please note that there's a question Pandas: group by and Pivot table difference, but this question is different.)

Suppose you start with a DataFrame

df = pd.DataFrame({'a': ['x'] * 2 + ['y'] * 2, 'b': [0, 1, 0, 1], 'val': range(4)})
>>> df
Out[18]: 
   a  b  val
0  x  0    0
1  x  1    1
2  y  0    2
3  y  1    3

Now suppose you want to make the index a, the columns b, the values in a cell val, and specify what to do if there are two or more values in a resulting cell:

b  0  1
a      
x  0  1
y  2  3

Then you can do this either through

df.val.groupby([df.a, df.b]).sum().unstack()

or through

pd.pivot_table(df, index='a', columns='b', values='val', aggfunc='sum')

So it seems to me that there's a simple correspondence between correspondence between the two (given one, you could almost write a script to transform it into the other). I also thought of more complex cases with hierarchical indices / columns, but I still see no difference.

Is there something I've missed?

  • Are there operations that can be performed using one and not the other?

  • Are there, perhaps, operations easier to perform using one over the other?

  • If not, why not deprecate pivot_tale? groupby seems much more general.

like image 327
Ami Tavory Avatar asked Sep 24 '16 09:09

Ami Tavory


People also ask

What's the difference between pivot_table () and Groupby ()?

What is the difference between the pivot_table and the groupby? The groupby method is generally enough for two-dimensional operations, but pivot_table is used for multi-dimensional grouping operations.

How do I unstack Groupby?

The first index will have the column name and the second index will have the name of the aggregated function. Now, use stack() at level 0 of the grouped dataframe and unstack() the grouped dataframe. Then, use stack() at level 1 of the grouped dataframe and unstack() the grouped dataframe.

What does unstack () do?

Pivot a level of the (necessarily hierarchical) index labels. Returns a DataFrame having a new level of column labels whose inner-most level consists of the pivoted index labels. If the index is not a MultiIndex, the output will be a Series (the analogue of stack when the columns are not a MultiIndex).

How is pivot_table () different from pivot () when both perform pivoting?

Basically, the pivot_table() function is a generalization of the pivot() function that allows aggregation of values — for example, through the len() function in the previous example. Pivot only works — or makes sense — if you need to pivot a table and show values without any aggregation. Here's an example.


1 Answers

If I understood the source code for pivot_table(index, columns, values, aggfunc) correctly it's tuned up equivalent for:

df.groupby([index + columns]).agg(aggfunc).unstack(columns)

plus:

  • margins (subtotals and grand totals as @ayhan has already said)
  • pivot_table() also removes extra multi-levels from columns axis (see example below)
  • convenient dropna parameter: Do not include columns whose entries are all NaN

Demo: (I took this DF from the docstring [source code for pivot_table()])

In [40]: df
Out[40]:
     A    B      C  D
0  foo  one  small  1
1  foo  one  large  2
2  foo  one  large  2
3  foo  two  small  3
4  foo  two  small  3
5  bar  one  large  4
6  bar  one  small  5
7  bar  two  small  6
8  bar  two  large  7

In [41]: df.pivot_table(index=['A','B'], columns='C', values='D', aggfunc=[np.sum,np.mean])
Out[41]:
          sum        mean
C       large small large small
A   B
bar one   4.0   5.0   4.0   5.0
    two   7.0   6.0   7.0   6.0
foo one   4.0   1.0   2.0   1.0
    two   NaN   6.0   NaN   3.0

pay attention at the top level column: D

In [42]: df.groupby(['A','B','C']).agg([np.sum, np.mean]).unstack('C')
Out[42]:
            D
          sum        mean
C       large small large small
A   B
bar one   4.0   5.0   4.0   5.0
    two   7.0   6.0   7.0   6.0
foo one   4.0   1.0   2.0   1.0
    two   NaN   6.0   NaN   3.0

why not deprecate pivot_tale? groupby seems much more general.

IMO, because it's very easy to use and very convenient! ;)

like image 54
MaxU - stop WAR against UA Avatar answered Sep 28 '22 13:09

MaxU - stop WAR against UA