Group and average dataframe rows based on a condition [closed]

Question

I have the following dataframe:

Company_ID  Year   Metric_1  Metric_2  Bankrupt
1           2010   10        20        0.0
1           2011   NaN       30        0.0
1           2012   30        40        0.0
1           2013   50        NaN       1.0
2           2012   50        60        0.0
2           2013   60        NaN       0.0
2           2014   10        10        0.0
3           2011   100       100       1.0

What I would like to do is for each company make the average of every metric of all the years except the last year. It should only do the average of present values and ignore missing values (NaN). Also it should not average Bankrupt column.

So output should look something like this:

Company_ID  Year        Metric_1  Metric_2  Bankrupt
1           2010-2012   20        30        0.0
1           2013        50        Nan       1.0
2           2012-2013   55        60        0.0
2           2014        10        10        0.0
3           2011        100       100       1.0

Thank you for your help.

Ben.T · Accepted Answer

This way is similar to the method of @Stef but I leave this one as it would work on any number of column Metric (as long as their names start with Metric). If you end up using this solution, please accept them solution instead.

You can do it like this

#mask for catching last year per Company
m = df.groupby(['Company_ID'])['Year'].transform('max').eq(df['Year'])
# create groups per company without the last year
gr = df[~m].groupby(df['Company_ID'], as_index=False)

df_ = (pd.concat([gr.agg(Company_ID=('Company_ID', 'first'), #perform agg depending on needs
                         Bankrupt=('Bankrupt', 'first'), #here I'm not sure with value you want
                         Year=('Year', lambda x: f'{x.min()}-{x.max()}')), 
                  gr[df.filter(like='Metric').columns].mean()], 
                 axis=1)
         .append(df[m]) # append last year
         .sort_values(['Company_ID'])
         .reset_index(drop=True)
      )
print (df_)   
   Company_ID  Bankrupt       Year  Metric_1  Metric_2
0           1       0.0  2010-2012      20.0      30.0
1           1       1.0       2013      50.0       NaN
2           2       0.0  2012-2013      55.0      60.0
3           2       0.0       2014      10.0      10.0
4           3       1.0       2011     100.0     100.0

Another version to avoid the append and sort_values, you can do it with a different lambda function for the Year column

#mask for catching last year per Company
m = df.groupby(['Company_ID'])['Year'].transform('max').eq(df['Year']) #same
# create groups per company without the last year
gr = df.groupby([df['Company_ID'], m]) #m is in the groupby and not as mask

df_ = (pd.concat([gr.agg(Company_ID=('Company_ID', 'first'), 
                        Bankrupt=('Bankrupt', 'first'),
                        Year=('Year', lambda x: f'{x.min()}-{x.max()}' if x.min()!=x.max()
                                                else x.max())), #different lambda function
                  gr[df.filter(like='Metric').columns].mean()], 
                 axis=1)
         #no more append/sort_values
         .reset_index(drop=True)
      )

Stef · Answer

m = df.Bankrupt.eq(0) & df.groupby('Company_ID').Year.transform(lambda x: x != x.max())
df[m].groupby(['Company_ID','Bankrupt']).agg(Year=('Year', lambda x: f'{x.min()}-{x.max()}'),
    Metric_1=('Metric_1', 'mean'),
    Metric_2=('Metric_2', 'mean')).reset_index().append(df[~m]).sort_values('Company_ID')

Result:

    Company_ID  Bankrupt       Year  Metric_1  Metric_2
0           1       0.0  2010-2012      20.0      30.0
3           1       1.0       2013      50.0       NaN
1           2       0.0  2012-2013      55.0      60.0
6           2       0.0       2014      10.0      10.0
7           3       1.0       2011     100.0     100.0

Group and average dataframe rows based on a condition [closed]

Tags:

python

pandas

dataframe

miloshdrago

2 Answers

Ben.T

Stef

Recent Activity

Donate For Us

Group and average dataframe rows based on a condition [closed]

Tags:

python

pandas

dataframe

miloshdrago

2 Answers

Ben.T

Stef

Related questions

Recent Activity

Donate For Us