Given a dataframe with different categorical variables, how do I return a cross-tabulation with percentages instead of frequencies?
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 6, 'B' : ['A', 'B', 'C'] * 8, 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4, 'D' : np.random.randn(24), 'E' : np.random.randn(24)}) pd.crosstab(df.A,df.B) B A B C A one 4 4 4 three 2 2 2 two 2 2 2
Using the margins option in crosstab to compute row and column totals gets us close enough to think that it should be possible using an aggfunc or groupby, but my meager brain can't think it through.
B A B C A one .33 .33 .33 three .33 .33 .33 two .33 .33 .33
You can caluclate pandas percentage with total by groupby() and DataFrame. transform() method. The transform() method allows you to execute a function for each value of the DataFrame. Here, the percentage directly summarized DataFrame, then the results will be calculated using all the data.
From Pandas 0.18.1 onwards, there's a normalize
option:
In [1]: pd.crosstab(df.A,df.B, normalize='index') Out[1]: B A B C A one 0.333333 0.333333 0.333333 three 0.333333 0.333333 0.333333 two 0.333333 0.333333 0.333333
Where you can normalise across either all
, index
(rows), or columns
.
More details are available in the documentation.
pd.crosstab(df.A, df.B).apply(lambda r: r/r.sum(), axis=1)
Basically you just have the function that does row/row.sum()
, and you use apply
with axis=1
to apply it by row.
(If doing this in Python 2, you should use from __future__ import division
to make sure division always returns a float.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With