I am a primarily JS developer who is trying to pick up pandas and perform some data analysis. Part of this analysis includes converting a team's match performances (win/loss) to a numeric rating (based off win percentage).
TLDR: I'm trying to get from DF 1 to DF 3.
DF 1
| season | opponent | outcome |
-------------------------------------
| 2020 | A | w |
| 2020 | A | l |
| 2020 | B | w |
| 2020 | B | w |
| 2020 | C | l |
| 2020 | C | l |
| 2021 | A | w |
| 2021 | A | w |
| 2021 | B | w |
| 2021 | B | l |
| 2021 | C | w |
| 2021 | C | w |
I need to calculate the winning percentage, grouped by season and opponent.
DF 2
| season | opponent | win % |
-------------------------------------
| 2020 | A | 50 |
| 2020 | B | 100 |
| 2020 | C | 0 |
| 2021 | A | 100 |
| 2021 | B | 50 |
| 2021 | C | 100 |
After that, we need to calculate the rating per season. This is done by averaging the win % across teams in the same season, but with the caveat that the win % agains team A is worth twice that of other teams. This is just arbitrary formula, the actual calculation is more complex (different opponents have different weights - I need a way to pass that as part of a custom Lambda function or something) but I'm trying to simplify things for this question.
DF 3
| season | rating |
-------------------------
| 2020 | 50.0 |
| 2021 | 87.5 |
Rating calculations example: Season 2020 rating = team A % * 2 + team B win % + team C win % / (total no of teams + 1) = (50% * 2 + 100% + 0%) / (3 + 1) = 50.0
How can we get from the first to the last dataframe using pandas? I am able to get to a version of DF 2 by using the following
df2 = df1.groupby(["season", "opponent"])["outcome"].value_counts(normalize = True).to_frame()
This frame includes percentages for losses which are not needed, but it should not matter if I am able to filter/drop that as part of the "transformation" to DF 3.
I have been trying to do things like df2 = df2[df2["outcome"] != "w"], or df2 = df2.query('outcome != "w"') to drop the additional rows with the loss condition based on answers to another question, but to no avail. I suspect this is because outcome is a nested column. Have also noticed this question but what I think I need is a "wildcard" to access the nested outcome column regardless of opponent.
Note: if there are more efficient ways to get from DF 1 to DF 3 directly (this seems close but not quite), I'm happy to explore those methods too.
You can get df2 as follows:
df2 = (df1.groupby(["season", "opponent"])["outcome"]
.value_counts(normalize=True)
.unstack(fill_value=0).stack(dropna=False)
.mul(100)
.reset_index(name='win %')
.query('outcome == "w"')
).reset_index(drop=True)
Result
print(df2)
season opponent outcome win %
0 2020 A w 50.0
1 2020 B w 100.0
2 2020 C w 0.0
3 2021 A w 100.0
4 2021 B w 50.0
5 2021 C w 100.0
Then, to use the formula to get df3, you can use:
df2a = df2.set_index('season')
# Get: (team A % * 2 + team B win % + team C win %)
df3_x = (df2a.loc[df2a['opponent'] =='A', 'win %'] * 2
+ df2a.loc[df2a['opponent'] =='B', 'win %']
+ df2a.loc[df2a['opponent'] =='C', 'win %']
)
# Get (total no of teams + 1) for a particular season
df3_y = df2.groupby('season')['opponent'].count() + 1
df3 = (df3_x / df3_y).reset_index(name='rating')
Result
print(df3)
season rating
0 2020 50.0
1 2021 87.5
For your reference, here is the interim results during derivation of df3:
# team A % * 2 + team B win % + team C win %
print(df3_x)
season
2020 200.0
2021 350.0
Name: win %, dtype: float64
# (total no of teams + 1) for a particular season
print(df3_y)
season
2020 4
2021 4
Name: opponent, dtype: int64
import pandas as pd
df_test = pd.DataFrame(data={'season':[2020]*6 + [2021]*6, 'opponent': ['A', 'A', 'B', 'B', 'C', 'C']*2,
'outcome': ['w', 'l', 'w', 'w', 'l', 'l', 'w', 'w', 'w', 'l', 'w', 'w']})
df_weightage = pd.DataFrame(data={'season':[2020]*3 + [2021]*3, 'opponent': ['A', 'B', 'C']*2,
'weightage': [0.2, 0.3, 0.5, 0.1, 0.2, 0.7]})
print(df_test)
print('='*30)
print(df_weightage)
print('='*35)
def get_pct(data):
return len(data[data == 'w'])/len(data)
def get_rating(data):
return sum(data['win_percentage']*data['weightage'])/len(data)
df_test = df_test.groupby(["season", "opponent"])["outcome"].apply(get_pct).rename('win_percentage').reset_index()
print(df_test)
print('='*45)
df_test = df_test.merge(df_weightage, how= 'left', on=['season', 'opponent'])
print(df_test)
print('='*45)
df_ratings = df_test.groupby(['season'])[['win_percentage', 'weightage']].apply(get_rating).rename('ratings').reset_index()
print(df_ratings)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With