Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use the highest value for duplicate IDs (Pandas DataFrame)

Tags:

python

pandas

I am calculating the total sum of 'price' for each 'id'. But when there are duplicates on 'loc_id' it should use the highest price for calculations and ignore the lower prices for the same 'loc_id'.

The example below shows 3 duplicates for A-1. The highest price for A-1 is 100 so the total sum for A should be 100 + 200

df

id     loc_id    price
A        A-1        50
A        A-1        100
A        A-1        30
A        A-2        200

B        B-1        30      
B        B-2        50      
df_expected_result

id      total       average
A        300        150
B        80         40

without the max() method, the code runs but then it double counts the loc_id duplicates. I am trying to avoid double counting. How can I make it select only the max value for each unique loc_id?

(df.set_index(['id','loc_id'])
  .groupby(level=0)['price']
  .max()
  .agg({'total' : np.sum , 'average' : np.average })                     
  .sort_values('total',ascending=False))

I get an error when I include max() method:

ValueError: cannot perform both aggregation and transformation operations simultaneously
like image 609
yagiz Avatar asked Apr 25 '19 14:04

yagiz


2 Answers

Try with a double groupby():

(df.groupby(['id','loc_id'],as_index=False)['price'].max()
           .groupby('id')['price'].agg(['sum','mean']))

    sum  mean
id           
A   300   150
B    80    40
like image 78
anky Avatar answered Nov 17 '22 17:11

anky


Here is one way using sort_values + drop_duplicates

df=df.sort_values(['price']).drop_duplicates(['id','loc_id'],keep='last')
df.groupby(['id']).price.agg(['mean','sum'])
Out[366]: 
    mean  sum
id           
A    150  300
B     40   80
like image 3
BENY Avatar answered Nov 17 '22 18:11

BENY