Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I calculate a grouped and weighted aggregate of a column in pandas?

I am a primarily JS developer who is trying to pick up pandas and perform some data analysis. Part of this analysis includes converting a team's match performances (win/loss) to a numeric rating (based off win percentage).

TLDR: I'm trying to get from DF 1 to DF 3.

DF 1

|   season  | opponent  |   outcome |
-------------------------------------
|   2020    |   A       |   w       |
|   2020    |   A       |   l       |
|   2020    |   B       |   w       |
|   2020    |   B       |   w       |
|   2020    |   C       |   l       |
|   2020    |   C       |   l       |
|   2021    |   A       |   w       |
|   2021    |   A       |   w       |
|   2021    |   B       |   w       |
|   2021    |   B       |   l       |
|   2021    |   C       |   w       |
|   2021    |   C       |   w       |

I need to calculate the winning percentage, grouped by season and opponent.

DF 2

|   season  | opponent  |  win %    |
-------------------------------------
|   2020    |   A       |   50      |
|   2020    |   B       |   100     |
|   2020    |   C       |   0       |
|   2021    |   A       |   100     |
|   2021    |   B       |   50      |
|   2021    |   C       |   100     |

After that, we need to calculate the rating per season. This is done by averaging the win % across teams in the same season, but with the caveat that the win % agains team A is worth twice that of other teams. This is just arbitrary formula, the actual calculation is more complex (different opponents have different weights - I need a way to pass that as part of a custom Lambda function or something) but I'm trying to simplify things for this question.

DF 3

|   season  |   rating  |
-------------------------
|   2020    |   50.0    |
|   2021    |   87.5    |

Rating calculations example: Season 2020 rating = team A % * 2 + team B win % + team C win % / (total no of teams + 1) = (50% * 2 + 100% + 0%) / (3 + 1) = 50.0

How can we get from the first to the last dataframe using pandas? I am able to get to a version of DF 2 by using the following

df2 = df1.groupby(["season", "opponent"])["outcome"].value_counts(normalize = True).to_frame()

This frame includes percentages for losses which are not needed, but it should not matter if I am able to filter/drop that as part of the "transformation" to DF 3.

I have been trying to do things like df2 = df2[df2["outcome"] != "w"], or df2 = df2.query('outcome != "w"') to drop the additional rows with the loss condition based on answers to another question, but to no avail. I suspect this is because outcome is a nested column. Have also noticed this question but what I think I need is a "wildcard" to access the nested outcome column regardless of opponent.

Note: if there are more efficient ways to get from DF 1 to DF 3 directly (this seems close but not quite), I'm happy to explore those methods too.

like image 523
Alvin Teh Avatar asked Dec 18 '25 14:12

Alvin Teh


2 Answers

You can get df2 as follows:

df2 = (df1.groupby(["season", "opponent"])["outcome"]
          .value_counts(normalize=True)
          .unstack(fill_value=0).stack(dropna=False)
          .mul(100)
          .reset_index(name='win %')
          .query('outcome == "w"')
      ).reset_index(drop=True)

Result

print(df2)

   season opponent outcome  win %
0    2020        A       w   50.0
1    2020        B       w  100.0
2    2020        C       w    0.0
3    2021        A       w  100.0
4    2021        B       w   50.0
5    2021        C       w  100.0

Then, to use the formula to get df3, you can use:

df2a = df2.set_index('season')

# Get: (team A % * 2 + team B win % + team C win %)
df3_x = (df2a.loc[df2a['opponent'] =='A', 'win %'] * 2 
             + df2a.loc[df2a['opponent'] =='B', 'win %'] 
             + df2a.loc[df2a['opponent'] =='C', 'win %']
        )

# Get (total no of teams + 1) for a particular season
df3_y = df2.groupby('season')['opponent'].count() + 1

df3 = (df3_x / df3_y).reset_index(name='rating')

Result

print(df3)

   season  rating
0    2020    50.0
1    2021    87.5

For your reference, here is the interim results during derivation of df3:

# team A % * 2 + team B win % + team C win % 
print(df3_x)

season
2020    200.0
2021    350.0
Name: win %, dtype: float64
# (total no of teams + 1) for a particular season
print(df3_y)

season
2020    4
2021    4
Name: opponent, dtype: int64
like image 95
SeaBean Avatar answered Dec 20 '25 03:12

SeaBean


import pandas as pd

df_test = pd.DataFrame(data={'season':[2020]*6 + [2021]*6, 'opponent': ['A', 'A', 'B', 'B', 'C', 'C']*2,
                        'outcome': ['w', 'l', 'w', 'w', 'l', 'l', 'w', 'w', 'w', 'l', 'w', 'w']})

df_weightage = pd.DataFrame(data={'season':[2020]*3 + [2021]*3, 'opponent': ['A', 'B', 'C']*2,
                        'weightage': [0.2, 0.3, 0.5, 0.1, 0.2, 0.7]})

print(df_test)
print('='*30)
print(df_weightage)
print('='*35)

def get_pct(data):
    return len(data[data == 'w'])/len(data)

def get_rating(data):
    return sum(data['win_percentage']*data['weightage'])/len(data)

df_test = df_test.groupby(["season", "opponent"])["outcome"].apply(get_pct).rename('win_percentage').reset_index()
print(df_test)
print('='*45)

df_test = df_test.merge(df_weightage, how= 'left', on=['season', 'opponent'])
print(df_test)
print('='*45)

df_ratings = df_test.groupby(['season'])[['win_percentage', 'weightage']].apply(get_rating).rename('ratings').reset_index()
print(df_ratings)
like image 22
Muhammad Hassan Avatar answered Dec 20 '25 03:12

Muhammad Hassan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!