Using pandas, is it possible to compute a single cross-tabulation (or pivot table) containing values calculated from two different functions?
import pandas as pd
import numpy as np
c1 = np.repeat(['a','b'], [50, 50], axis=0)
c2 = list('xy'*50)
c3 = np.repeat(['G1','G2'], [50, 50], axis=0)
np.random.shuffle(c3)
c4=np.repeat([1,2], [50,50],axis=0)
np.random.shuffle(c4)
val = np.random.rand(100)
df = pd.DataFrame({'c1':c1, 'c2':c2, 'c3':c3, 'c4':c4, 'val':val})
frequencyTable = pd.crosstab([df.c1,df.c2],[df.c3,df.c4])
meanVal = pd.crosstab([df.c1,df.c2],[df.c3,df.c4],values=df.val,aggfunc=np.mean)
So, both the rows and the columns are the same in both tables, but what I'd really like is a table with both frequencies and mean values:
c3 G1 G2
c4 1 2 1 2
c1 c2 freq val freq val freq val freq val
a x 6 0.624931 5 0.582268 8 0.528231 6 0.362804
y 7 0.493890 8 0.465741 3 0.613126 7 0.312894
b x 9 0.488255 5 0.804015 6 0.722640 5 0.369480
y 6 0.462653 4 0.506791 5 0.583695 10 0.517954
What is the difference between the pivot_table and the groupby? The groupby method is generally enough for two-dimensional operations, but pivot_table is used for multi-dimensional grouping operations.
pivot() will error with a ValueError: Index contains duplicate entries, cannot reshape if the index/column pair is not unique. In this case, consider using pivot_table() which is a generalization of pivot that can handle duplicate values for one index/column pair.
Pandas DataFrame: pivot_table() function The pivot_table() function is used to create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.
Compute a simple cross tabulation of two (or more) factors. By default, computes a frequency table of the factors unless an array of values and an aggregation function are passed. Values to group by in the rows. Values to group by in the columns.
You can give a list of functions:
pd.crosstab([df.c1,df.c2], [df.c3,df.c4], values=df.val, aggfunc=[len, np.mean])
If you want the table as shown in your question, you will have to rearrange the levels a bit:
In [42]: table = pd.crosstab([df.c1,df.c2], [df.c3,df.c4], values=df.val, aggfunc=[len, np.mean])
In [43]: table
Out[43]:
len mean
c3 G1 G2 G1 G2
c4 1 2 1 2 1 2 1 2
c1 c2
a x 4 6 8 7 0.303036 0.414474 0.624900 0.425234
y 5 5 8 7 0.543363 0.480419 0.583499 0.637657
b x 10 6 4 5 0.400279 0.436929 0.442924 0.287572
y 6 8 5 6 0.400427 0.623319 0.764506 0.408708
In [44]: table.reorder_levels([1, 2, 0], axis=1).sort_index(axis=1)
Out[44]:
c3 G1 G2
c4 1 2 1 2
len mean len mean len mean len mean
c1 c2
a x 4 0.303036 6 0.414474 8 0.624900 7 0.425234
y 5 0.543363 5 0.480419 8 0.583499 7 0.637657
b x 10 0.400279 6 0.436929 4 0.442924 5 0.287572
y 6 0.400427 8 0.623319 5 0.764506 6 0.408708
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With