Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Pivot_Table : Percentage of row calculation for non-numeric values

Tags:

python

pandas

This is my DATA in dataframe "df":

Document    Name    Time
SPS2315511  A   1 HOUR
SPS2315512  B   1 - 2 HOUR
SPS2315513  C   2 - 3 HOUR
SPS2315514  C   1 HOUR
SPS2315515  B   1 HOUR
SPS2315516  A   2 - 3 HOUR
SPS2315517  A   1 - 2 HOUR

I am using the below code which gives me the summary of count in the pivot table,

table = pivot_table(df, values=["Document"],
                    index=["Name"], columns=["Time"],
                    aggfunc=lambda x: len(x),
                    margins=True, dropna=True)

but what i want is the % of row calculation as in excel pivot when you right click the pivot and select "show value as -> % of Row Total" . Since my Document is a non-numeric value i was not able to get it.

EXPECTED RESULT :

Count of Document   Column Labels

Name    1 HOUR  1 - 2 HOUR  2 - 3 HOUR  Grand Total
A   33.33%  33.33%  33.33%  100.00%
B   50.00%  50.00%  0.00%   100.00%
C   50.00%  0.00%   50.00%  100.00%
Grand Total 42.86%  28.57%  28.57%  100.00%

Can any one please help me figure out a way to get this result??

i am trying to manipulate the pivot data which will give me the row total,not the data from the dataframe and what i wanted is "% of row total". And also most importantly all my data are non-numeric values...

like image 881
keerthi kumar Avatar asked Jun 26 '15 04:06

keerthi kumar


2 Answers

The possible duplicate noted by @maxymoo is pretty close to a solution, but I'll go ahead and write it up as an answer since there are a couple of differences that are not completely straightforward.

table = pd.pivot_table(df, values=["Document"],
                       index=["Name"], columns=["Time"], 
                       aggfunc=len, margins=True, 
                       dropna=True, fill_value=0)

       Document                      
Time 1 - 2 HOUR 1 HOUR 2 - 3 HOUR All
Name                                 
A             1      1          1   3
B             1      1          0   2
C             0      1          1   2
All           2      3          2   7

The main tweak there is to add fill_value=0 because what you really want there is a count value of zero, not a NaN.

Then you can basically use the solution @maxymoo linked to, but you need to use iloc or similar b/c the table columns are a little complicated now (being a multi-indexed result of the pivot table).

table2 = table.div( table.iloc[:,-1], axis=0 )

       Document                         
Time 1 - 2 HOUR    1 HOUR 2 - 3 HOUR All
Name                                    
A      0.333333  0.333333   0.333333   1
B      0.500000  0.500000   0.000000   1
C      0.000000  0.500000   0.500000   1
All    0.285714  0.428571   0.285714   1

You've still got some minor formatting work to do there (flip first and second columns and convert to %), but those are the numbers you are looking for.

Btw, it's not necessary here, but you might want to think about converting 'Time' to an ordered categorical variable, which would be one way to solve the column ordering problem (I think), but may or may not be worth the bother depending on what else you are doing with the data.

like image 60
JohnE Avatar answered Sep 29 '22 12:09

JohnE


you can use something like this

df = pd.DataFrame({'Document':['SPS2315511','SPS2315512','SPS2315513','SPS2315514','SPS2315515','SPS2315516','SPS2315517'],
                  'Name':['A','B','C','C','B','A','A'],
                  'Time': ['1 HOUR','1 - 2 HOUR','2 - 3 HOUR','1 HOUR','1 HOUR','2 - 3 HOUR','1 - 2 HOUR']})
pd.crosstab(index= df.Name, columns= df.Time, values= df.Document, aggfunc = np.count_nonzero
            ,margins=True,margins_name='Total',normalize= 'index')
like image 21
sushmit Avatar answered Sep 29 '22 12:09

sushmit