So I'm trying to create new data in a time series based on past data. For example I have player data here with each row being stats accumulated at a certain age. I want to create new row in the Dataframe where I increment the max age by one and then take the average of the sa and ga column from the two years before that.
Here is the data
import pandas as pd
data = [['Adam Wilcox', 8476330, 25, 14.0, 0.0],
['Adin Hill', 8478499, 21, 129.0, 14.0],
['Adin Hill', 8478499, 22, 322.0, 32.0],
['Adin Hill', 8478499, 23, 343.0, 28.0],
['Adin Hill', 8478499, 24, 530.0, 46.0],
['Adin Hill', 8478499, 25, 237.0, 26.0],
['Al Montoya', 8471219, 24, 120.0, 9.0],
['Al Montoya', 8471219, 26, 585.0, 46.0],
['Al Montoya', 8471219, 27, 832.0, 89.0],
['Al Montoya', 8471219, 28, 168.0, 17.0]]
model_df = pd.DataFrame(data,
columns=['player', 'player_id', 'season_age', 'sa', 'ga'])
For example what I would want to create is ['Al Montoya', 8471219, 29, 500, 53] (remember the last two values is the average of the sa and ga columns from ages 28 and 27).
I've accomplished this using iterrows and creating a new Dataframe and appending like this:
max_ages = model_df.groupby(['player', 'player_id'])[['season_age']].max().reset_index()
added_ages = []
for player in max_ages.iterrows():
row = [player[1][0],
player[1][1],
player[1][2] + 1,
(model_df[(model_df['player_id'] == player[1][1]) &
(model_df['season_age'] == player[1][2] - 1)]['sa'].sum() +
model_df[(model_df['player_id'] == player[1][1]) &
(model_df['season_age'] == player[1][2] - 2)]['sa'].sum())/2,
(model_df[(model_df['player_id'] == player[1][1]) &
(model_df['season_age'] == player[1][2] - 1)]['ga'].sum() +
model_df[(model_df['player_id'] == player[1][1]) &
(model_df['season_age'] == player[1][2] - 2)]['ga'].sum())/2
]
added_ages.append(row)
added_ages_df = pd.DataFrame(added_ages,
columns=['player', 'player_id', 'season_age', 'sa', 'ga'])
model_df = pd.concat([model_df, added_ages_df])
Obviously this is an adhoc solution that is very brittle, my question is if there is a built in way in pandas of doing this without using iterrows
The expected Dataframe would look like this easier to represent in list form
data = [['Adam Wilcox', 8476330, 25, 14.0, 0.0],
['Adin Hill', 8478499, 21, 129.0, 14.0],
['Adin Hill', 8478499, 22, 322.0, 32.0],
['Adin Hill', 8478499, 23, 343.0, 28.0],
['Adin Hill', 8478499, 24, 530.0, 46.0],
['Adin Hill', 8478499, 25, 237.0, 26.0],
['Adin Hill', 8478499, 26, 502, 36],
['Al Montoya', 8471219, 24, 120.0, 9.0],
['Al Montoya', 8471219, 26, 585.0, 46.0],
['Al Montoya', 8471219, 27, 832.0, 89.0],
['Al Montoya', 8471219, 28, 168.0, 17.0],
['Al Montoya', 8471219, 29, 500, 53]]
You can define a function called add_row and pass it to a groupby. I'll assume that if there aren't two years of data for a player, you'll want the columns sa and ga to be populated with NaN:
def add_row(x):
last_row = x.iloc[-1]
last_row['season_age'] = last_row['season_age']+1
if len(x) < 2:
last_row['sa'], last_row['ga'] = float("nan"), float("nan")
return x.append(last_row)
else:
last_row['sa'], last_row['ga'] = x[['sa','ga']].iloc[-2:].mean()
return x.append(last_row)
new_model_df = model_df.groupby("player").apply(add_row).reset_index(drop=True)
Output:
>>> new_model_df
player player_id season_age sa ga
0 Adam Wilcox 8476330 25 14.0 0.0
1 Adam Wilcox 8476330 26 NaN NaN
2 Adin Hill 8478499 21 129.0 14.0
3 Adin Hill 8478499 22 322.0 32.0
4 Adin Hill 8478499 23 343.0 28.0
5 Adin Hill 8478499 24 530.0 46.0
6 Adin Hill 8478499 25 237.0 26.0
7 Adin Hill 8478499 26 383.5 36.0
8 Al Montoya 8471219 24 120.0 9.0
9 Al Montoya 8471219 26 585.0 46.0
10 Al Montoya 8471219 27 832.0 89.0
11 Al Montoya 8471219 28 168.0 17.0
12 Al Montoya 8471219 29 500.0 53.0
Do some computation on the grouped object, and merge the result to the model_df:
grouper = ['player', 'player_id']
grouped = model_df.groupby(grouper, sort = False)
tail = grouped.nth(-1) # get the last row per group
tail = tail.assign(season_age = tail.season_age + 1)
# get the average of the last two columns with rolling
# a second groupby is called here to get single rows per group
sa_ga = (group[['sa', 'ga']]
.rolling(2)
.mean()
.groupby(grouper)
.nth(-1)
)
tail = tail.assign(**sa_ga).reset_index()
# final output
(pd.concat([model_df, tail])
.sort_values(grouper, ignore_index = True)
)
player player_id season_age sa ga
0 Adam Wilcox 8476330 25 14.0 0.0
1 Adam Wilcox 8476330 26 NaN NaN
2 Adin Hill 8478499 21 129.0 14.0
3 Adin Hill 8478499 22 322.0 32.0
4 Adin Hill 8478499 23 343.0 28.0
5 Adin Hill 8478499 24 530.0 46.0
6 Adin Hill 8478499 25 237.0 26.0
7 Adin Hill 8478499 26 383.5 36.0
8 Al Montoya 8471219 24 120.0 9.0
9 Al Montoya 8471219 26 585.0 46.0
10 Al Montoya 8471219 27 832.0 89.0
11 Al Montoya 8471219 28 168.0 17.0
12 Al Montoya 8471219 29 500.0 53.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With