Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculate percentage of similar values in pandas dataframe

I have one dataframe df, with two columns : Script (with text) and Speaker

Script  Speaker
aze     Speaker 1 
art     Speaker 2
ghb     Speaker 3
jka     Speaker 1
tyc     Speaker 1
avv     Speaker 2 
bhj     Speaker 1

And I have the following list : L = ['a','b','c']

With the following code,

df = (df.set_index('Speaker')['Script'].str.findall('|'.join(L))
        .str.join('|')
        .str.get_dummies()
        .sum(level=0))
print (df)

I obtain this dataframe df2 :

Speaker     a    b    c
Speaker 1   2    1    1
Speaker 2   2    0    0
Speaker 3   0    1    0

Which line can I add in my code to obtain, for each line of my dataframe df2, a percentage value of all lines spoken by speaker, in order to have the following dataframe df3 :

Speaker     a    b    c
Speaker 1   50%  25%   25%
Speaker 2  100%    0   0
Speaker 3   0   100%   0
like image 741
Alex Dana Avatar asked Dec 27 '19 15:12

Alex Dana


People also ask

How do you find the percentage of a value in pandas?

You can caluclate pandas percentage with total by groupby() and DataFrame. transform() method. The transform() method allows you to execute a function for each value of the DataFrame. Here, the percentage directly summarized DataFrame, then the results will be calculated using all the data.

How do you format all the values in a data frame as percentages?

In this snippet we convert each the values in the dataframe to the percentage each value represent across the row of the dataframe. First we create a 'total' column for each row and then use pipe and lambda to divide each value in the row by the 'total' column and format as a percentage.

How do you find a proportion in Python?

To calculate a percentage in Python, use the division operator (/) to get the quotient from two numbers and then multiply this quotient by 100 using the multiplication operator (*) to get the percentage. This is a simple equation in mathematics to get the percentage.

How do you display the null value of a percentage against all columns?

To find the percentage of missing values in each column of an R data frame, we can use colMeans function with is.na function. This will find the mean of missing values in each column. After that we can multiply the output with 100 to get the percentage.


1 Answers

You could divide by the sum along the first axis and then cast to string and add %:

out = (df.set_index('Speaker')['Script'].str.findall('|'.join(L))
         .str.join('|')
         .str.get_dummies()
         .sum(level=0))

(out/out.sum(0)[:,None]).mul(100).astype(int).astype(str).add('%')

            a     b    c
Speaker                  
Speaker1   50%   25%  25%
Speaker2  100%    0%   0%
Speaker3    0%  100%   0%
like image 182
yatu Avatar answered Sep 23 '22 05:09

yatu