Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Aggregating using arbitrary precedence in pandas

Given the dataframe

  Column1 Column2  Column3
0       a     foo        1
1       a     bar        2
2       b     baz       12
3       b     foo        4
4       c     bar        6
5       c     foo        3
6       c     baz        7
7       d     foo        9

I'd like to groupby Column1, using an arbitrary order of precedence for which values to keep from column3.

For example, if the order of precedence is:

  1. baz
  2. bar
  3. foo

then I would expect the output to show as

         Column2
Column1         
a              2
b             12
c              7
d              9

with the "a" group keeping the "bar" value because there is no "baz" for the "a" group, "b" group keeping the "baz" value, and so on.

What's the most elegent way to do this? Right now I'm applying a series of apply lambda's to work through each item, but it feels sloppy.

EDIT: What if the precendence goes across multiple columns?

Ex.

  Column1 Column2 Column3  Column4
0       a     foo    john        1
1       a     bar     jim        2
2       b     baz    jack       12
3       b     foo     jim        4
4       c     bar    john        6
5       c     foo    john        3
6       c     baz    jack        7
7       d     foo    jack        9

If the order of precedence across both Column2 and Column3 is:

  1. jim
  2. baz
  3. foo

then I would expect the output to show as

        Column2  Column3
Column1                 
a           jim        2
b           jim        4
c           baz        7
d           foo        9
like image 496
bcalc Avatar asked Jan 24 '23 17:01

bcalc


1 Answers

You can try with the below logic with map then groupby+transform

order = ['baz','bar','foo']
d = {v:k for k,v in dict(enumerate(order)).items()}
out = df.assign(k=df['Column2'].map(d))

print(df[out['k'].eq(out.groupby("Column1")['k'].transform("min"))])

  Column1 Column2  Column3
1       a     bar        2
2       b     baz       12
6       c     baz        7
7       d     foo        9

EDIT , for multiple columns, using the same logic as above, here is a way:

order = ['jim','baz','foo']
d = {i:e for e,i in enumerate(order)}

s = df[['Column2','Column3']].replace(d).apply(pd.to_numeric,errors='coerce').min(1)

out = (s[s.eq(s.groupby(df['Column1']).transform("min"))]
       .replace(dict(enumerate(order))).rename("Col"))

df.loc[out.index,["Column1","Column4"]].join(out)

  Column1  Column4  Col
1       a        2  jim
3       b        4  jim
6       c        7  baz
7       d        9  foo
like image 58
anky Avatar answered Jan 27 '23 06:01

anky