This is my DATA in dataframe "df":
Document Name Time
SPS2315511 A 1 HOUR
SPS2315512 B 1 - 2 HOUR
SPS2315513 C 2 - 3 HOUR
SPS2315514 C 1 HOUR
SPS2315515 B 1 HOUR
SPS2315516 A 2 - 3 HOUR
SPS2315517 A 1 - 2 HOUR
I am using the below code which gives me the summary of count in the pivot table,
table = pivot_table(df, values=["Document"],
index=["Name"], columns=["Time"],
aggfunc=lambda x: len(x),
margins=True, dropna=True)
but what i want is the % of row calculation as in excel pivot when you right click the pivot and select "show value as -> % of Row Total" . Since my Document is a non-numeric value i was not able to get it.
Count of Document Column Labels
Name 1 HOUR 1 - 2 HOUR 2 - 3 HOUR Grand Total
A 33.33% 33.33% 33.33% 100.00%
B 50.00% 50.00% 0.00% 100.00%
C 50.00% 0.00% 50.00% 100.00%
Grand Total 42.86% 28.57% 28.57% 100.00%
Can any one please help me figure out a way to get this result??
i am trying to manipulate the pivot data which will give me the row total,not the data from the dataframe and what i wanted is "% of row total". And also most importantly all my data are non-numeric values...
The possible duplicate noted by @maxymoo is pretty close to a solution, but I'll go ahead and write it up as an answer since there are a couple of differences that are not completely straightforward.
table = pd.pivot_table(df, values=["Document"],
index=["Name"], columns=["Time"],
aggfunc=len, margins=True,
dropna=True, fill_value=0)
Document
Time 1 - 2 HOUR 1 HOUR 2 - 3 HOUR All
Name
A 1 1 1 3
B 1 1 0 2
C 0 1 1 2
All 2 3 2 7
The main tweak there is to add fill_value=0
because what you really want there is a count value of zero, not a NaN.
Then you can basically use the solution @maxymoo linked to, but you need to use iloc
or similar b/c the table columns are a little complicated now (being a multi-indexed result of the pivot table).
table2 = table.div( table.iloc[:,-1], axis=0 )
Document
Time 1 - 2 HOUR 1 HOUR 2 - 3 HOUR All
Name
A 0.333333 0.333333 0.333333 1
B 0.500000 0.500000 0.000000 1
C 0.000000 0.500000 0.500000 1
All 0.285714 0.428571 0.285714 1
You've still got some minor formatting work to do there (flip first and second columns and convert to %), but those are the numbers you are looking for.
Btw, it's not necessary here, but you might want to think about converting 'Time' to an ordered categorical variable, which would be one way to solve the column ordering problem (I think), but may or may not be worth the bother depending on what else you are doing with the data.
you can use something like this
df = pd.DataFrame({'Document':['SPS2315511','SPS2315512','SPS2315513','SPS2315514','SPS2315515','SPS2315516','SPS2315517'],
'Name':['A','B','C','C','B','A','A'],
'Time': ['1 HOUR','1 - 2 HOUR','2 - 3 HOUR','1 HOUR','1 HOUR','2 - 3 HOUR','1 - 2 HOUR']})
pd.crosstab(index= df.Name, columns= df.Time, values= df.Document, aggfunc = np.count_nonzero
,margins=True,margins_name='Total',normalize= 'index')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With