I have the following dataframe:
Company_ID Year Metric_1 Metric_2 Bankrupt
1 2010 10 20 0.0
1 2011 NaN 30 0.0
1 2012 30 40 0.0
1 2013 50 NaN 1.0
2 2012 50 60 0.0
2 2013 60 NaN 0.0
2 2014 10 10 0.0
3 2011 100 100 1.0
What I would like to do is for each company make the average of every metric of all the years except the last year. It should only do the average of present values and ignore missing values (NaN). Also it should not average Bankrupt column.
So output should look something like this:
Company_ID Year Metric_1 Metric_2 Bankrupt
1 2010-2012 20 30 0.0
1 2013 50 Nan 1.0
2 2012-2013 55 60 0.0
2 2014 10 10 0.0
3 2011 100 100 1.0
Thank you for your help.
This way is similar to the method of @Stef but I leave this one as it would work on any number of column Metric (as long as their names start with Metric). If you end up using this solution, please accept them solution instead.
You can do it like this
#mask for catching last year per Company
m = df.groupby(['Company_ID'])['Year'].transform('max').eq(df['Year'])
# create groups per company without the last year
gr = df[~m].groupby(df['Company_ID'], as_index=False)
df_ = (pd.concat([gr.agg(Company_ID=('Company_ID', 'first'), #perform agg depending on needs
Bankrupt=('Bankrupt', 'first'), #here I'm not sure with value you want
Year=('Year', lambda x: f'{x.min()}-{x.max()}')),
gr[df.filter(like='Metric').columns].mean()],
axis=1)
.append(df[m]) # append last year
.sort_values(['Company_ID'])
.reset_index(drop=True)
)
print (df_)
Company_ID Bankrupt Year Metric_1 Metric_2
0 1 0.0 2010-2012 20.0 30.0
1 1 1.0 2013 50.0 NaN
2 2 0.0 2012-2013 55.0 60.0
3 2 0.0 2014 10.0 10.0
4 3 1.0 2011 100.0 100.0
Another version to avoid the append and sort_values, you can do it with a different lambda function for the Year column
#mask for catching last year per Company
m = df.groupby(['Company_ID'])['Year'].transform('max').eq(df['Year']) #same
# create groups per company without the last year
gr = df.groupby([df['Company_ID'], m]) #m is in the groupby and not as mask
df_ = (pd.concat([gr.agg(Company_ID=('Company_ID', 'first'),
Bankrupt=('Bankrupt', 'first'),
Year=('Year', lambda x: f'{x.min()}-{x.max()}' if x.min()!=x.max()
else x.max())), #different lambda function
gr[df.filter(like='Metric').columns].mean()],
axis=1)
#no more append/sort_values
.reset_index(drop=True)
)
m = df.Bankrupt.eq(0) & df.groupby('Company_ID').Year.transform(lambda x: x != x.max())
df[m].groupby(['Company_ID','Bankrupt']).agg(Year=('Year', lambda x: f'{x.min()}-{x.max()}'),
Metric_1=('Metric_1', 'mean'),
Metric_2=('Metric_2', 'mean')).reset_index().append(df[~m]).sort_values('Company_ID')
Result:
Company_ID Bankrupt Year Metric_1 Metric_2
0 1 0.0 2010-2012 20.0 30.0
3 1 1.0 2013 50.0 NaN
1 2 0.0 2012-2013 55.0 60.0
6 2 0.0 2014 10.0 10.0
7 3 1.0 2011 100.0 100.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With