Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make a pandas crosstab with percentages?

Given a dataframe with different categorical variables, how do I return a cross-tabulation with percentages instead of frequencies?

df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 6,                    'B' : ['A', 'B', 'C'] * 8,                    'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4,                    'D' : np.random.randn(24),                    'E' : np.random.randn(24)})   pd.crosstab(df.A,df.B)   B       A    B    C A                one     4    4    4 three   2    2    2 two     2    2    2 

Using the margins option in crosstab to compute row and column totals gets us close enough to think that it should be possible using an aggfunc or groupby, but my meager brain can't think it through.

B       A     B    C A                one     .33  .33  .33 three   .33  .33  .33 two     .33  .33  .33 
like image 679
Brian Keegan Avatar asked Jan 21 '14 00:01

Brian Keegan


People also ask

How do you do percentages in pandas?

You can caluclate pandas percentage with total by groupby() and DataFrame. transform() method. The transform() method allows you to execute a function for each value of the DataFrame. Here, the percentage directly summarized DataFrame, then the results will be calculated using all the data.


2 Answers

From Pandas 0.18.1 onwards, there's a normalize option:

In [1]: pd.crosstab(df.A,df.B, normalize='index') Out[1]:  B              A           B           C A            one     0.333333    0.333333    0.333333 three   0.333333    0.333333    0.333333 two     0.333333    0.333333    0.333333 

Where you can normalise across either all, index (rows), or columns.

More details are available in the documentation.

like image 142
Harry Avatar answered Sep 21 '22 17:09

Harry


pd.crosstab(df.A, df.B).apply(lambda r: r/r.sum(), axis=1) 

Basically you just have the function that does row/row.sum(), and you use apply with axis=1 to apply it by row.

(If doing this in Python 2, you should use from __future__ import division to make sure division always returns a float.)

like image 32
BrenBarn Avatar answered Sep 24 '22 17:09

BrenBarn