Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate ratio of values in a pandas dataframe column?

Tags:

python

pandas

I'm new to pandas and decided to learn it by playing around with some data I pulled from my favorite game's API. I have a dataframe with two columns "playerId" and "winner" like so:

playerStatus:
______________________
   playerId   winner
0    1848      True
1    1988      False
2    3543      True
3    1848      False
4    1988      False
...

Each row represents a match the player participated in. My goal is to either transform this dataframe or create a new one such that the win percentage for each playerId is calculated. For example, the above dataframe would become:

playerWinsAndTotals
_________________________________________
   playerId   wins  totalPlayed   winPct
0    1848      1        2         50.0000
1    1988      0        2         0.0000
2    3543      1        1         100.0000
...

It took quite a while of reading pandas docs, but I actually managed to achieve this by essentially creating two different tables (one to find the number of wins for each player, one to find the total games for each player), and merging them, then taking the ratio of wins to games played.

Creating the "wins" dataframe:

temp_df = playerStatus[['playerId', 'winner']].value_counts().reset_index(name='wins')
onlyWins = temp_df[temp_df['winner'] == True][['playerId', 'wins']]
onlyWins
_________________________
    playerId    wins
1     1670       483
3     1748       474
4     2179       468
6     4006       434
8     1668       392
...

Creating the "totals" dataframe:

totalPlayed = playerStatus['playerId'].value_counts().reset_index(name='totalCount').rename(columns={'index': 'playerId'})
totalPlayed
____________________

   playerId   totalCount
0    1670        961
1    1748        919
2    1872        877
3    4006        839
4    2179        837
...

Finally, merging them and adding the "winPct" column.

playerWinsAndTotals = onlyWins.merge(totalPlayed, on='playerId', how='left')
playerWinsAndTotals['winPct'] = playerWinsAndTotals['wins']/playerWinsAndTotals['totalCount'] * 100
playerWinsAndTotals
_____________________________________________

   playerId   wins   totalCount     winPct
0    1670      483      961       50.260146
1    1748      474      919       51.577802
2    2179      468      837       55.913978
3    4006      434      839       51.728248
4    1668      392      712       55.056180
...

Now, the reason I am posting this here is because I know I'm not taking full advantage of what pandas has to offer. Creating and merging two different dataframes just to find the ratio of player wins seems unnecessary. I feel like I took the "scenic" route on this one.

To anyone more experienced than me, how would you tackle this problem?

like image 419
Bankst Avatar asked Aug 16 '21 01:08

Bankst


People also ask

How do I get the percentage of a column in pandas?

You can caluclate pandas percentage with total by groupby() and DataFrame. transform() method. The transform() method allows you to execute a function for each value of the DataFrame. Here, the percentage directly summarized DataFrame, then the results will be calculated using all the data.

How do you find the proportion of a data frame?

A Percentage is calculated by the mathematical formula of dividing the value by the sum of all the values and then multiplying the sum by 100. This is also applicable in Pandas Dataframes. Here, the pre-defined sum() method of pandas series is used to compute the sum of all the values of a column.

How do you find a ratio in Python?

In [6]: def ratioFunction(): ...: num1 = input('Enter the first number: ') ...: num1 = int(num1) # Now we are good ...: num2 = input('Enter the second number: ') ...: num2 = int(num2) # Good, good ...: ratio12 = int(num1/num2) ...: print('The ratio of', num1, 'and', num2,'is', str(ratio12) + '.

How do pandas calculate percentage difference?

The pct_change() method returns a DataFrame with the percentage difference between the values for each row and, by default, the previous row. Which row to compare with can be specified with the periods parameter.


1 Answers

We can take advantage of the way that Boolean values are handled mathematically (True being 1 and False being 0) and use 3 aggregation functions sum, count and mean per group (groupby aggregate). We can also take advantage of Named Aggregation to both create and rename the columns in one step:

df = (
    df.groupby('playerId', as_index=False)
        .agg(wins=('winner', 'sum'),
             totalCount=('winner', 'count'),
             winPct=('winner', 'mean'))
)
# Scale up winPct
df['winPct'] *= 100

df:

   playerId  wins  totalCount  winPct
0      1848     1           2    50.0
1      1988     0           2     0.0
2      3543     1           1   100.0

DataFrame and imports:

import pandas as pd

df = pd.DataFrame({
    'playerId': [1848, 1988, 3543, 1848, 1988],
    'winner': [True, False, True, False, False]
})
like image 147
Henry Ecker Avatar answered Oct 26 '22 22:10

Henry Ecker