(Please note that there's a question Pandas: group by and Pivot table difference, but this question is different.)
Suppose you start with a DataFrame
df = pd.DataFrame({'a': ['x'] * 2 + ['y'] * 2, 'b': [0, 1, 0, 1], 'val': range(4)})
>>> df
Out[18]:
a b val
0 x 0 0
1 x 1 1
2 y 0 2
3 y 1 3
Now suppose you want to make the index a
, the columns b
, the values in a cell val
, and specify what to do if there are two or more values in a resulting cell:
b 0 1
a
x 0 1
y 2 3
Then you can do this either through
df.val.groupby([df.a, df.b]).sum().unstack()
or through
pd.pivot_table(df, index='a', columns='b', values='val', aggfunc='sum')
So it seems to me that there's a simple correspondence between correspondence between the two (given one, you could almost write a script to transform it into the other). I also thought of more complex cases with hierarchical indices / columns, but I still see no difference.
Is there something I've missed?
Are there operations that can be performed using one and not the other?
Are there, perhaps, operations easier to perform using one over the other?
If not, why not deprecate pivot_tale
? groupby
seems much more general.
What is the difference between the pivot_table and the groupby? The groupby method is generally enough for two-dimensional operations, but pivot_table is used for multi-dimensional grouping operations.
The first index will have the column name and the second index will have the name of the aggregated function. Now, use stack() at level 0 of the grouped dataframe and unstack() the grouped dataframe. Then, use stack() at level 1 of the grouped dataframe and unstack() the grouped dataframe.
Pivot a level of the (necessarily hierarchical) index labels. Returns a DataFrame having a new level of column labels whose inner-most level consists of the pivoted index labels. If the index is not a MultiIndex, the output will be a Series (the analogue of stack when the columns are not a MultiIndex).
Basically, the pivot_table() function is a generalization of the pivot() function that allows aggregation of values — for example, through the len() function in the previous example. Pivot only works — or makes sense — if you need to pivot a table and show values without any aggregation. Here's an example.
If I understood the source code for pivot_table(index, columns, values, aggfunc)
correctly it's tuned up equivalent for:
df.groupby([index + columns]).agg(aggfunc).unstack(columns)
plus:
pivot_table()
also removes extra multi-levels from columns axis (see example below)dropna
parameter: Do not include columns whose entries are all NaNDemo: (I took this DF from the docstring [source code for pivot_table()
])
In [40]: df
Out[40]:
A B C D
0 foo one small 1
1 foo one large 2
2 foo one large 2
3 foo two small 3
4 foo two small 3
5 bar one large 4
6 bar one small 5
7 bar two small 6
8 bar two large 7
In [41]: df.pivot_table(index=['A','B'], columns='C', values='D', aggfunc=[np.sum,np.mean])
Out[41]:
sum mean
C large small large small
A B
bar one 4.0 5.0 4.0 5.0
two 7.0 6.0 7.0 6.0
foo one 4.0 1.0 2.0 1.0
two NaN 6.0 NaN 3.0
pay attention at the top level column: D
In [42]: df.groupby(['A','B','C']).agg([np.sum, np.mean]).unstack('C')
Out[42]:
D
sum mean
C large small large small
A B
bar one 4.0 5.0 4.0 5.0
two 7.0 6.0 7.0 6.0
foo one 4.0 1.0 2.0 1.0
two NaN 6.0 NaN 3.0
why not deprecate pivot_tale? groupby seems much more general.
IMO, because it's very easy to use and very convenient! ;)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With