I have a large (>10000 rows) dataframe like below, and I have to combine the rows with the same name by col_1, col_2, col_3 of group a and col_4, col_5 of group b or group c (b & c won't exist at the same time).
I could just find online how to combine (add up) every column of the rows, put not in different parts.
Original dataframe
| name | group | col_1 | col_2 | col_3 | col_4 | col_5 |
|---|---|---|---|---|---|---|
| ann | a | 1 | NaN | 2 | 3 | NaN |
| ann | b | 1 | NaN | NaN | 2 | 3 |
| ben | a | 0 | 1 | 2 | 1 | NaN |
| ben | c | 0 | NaN | NaN | 3 | NaN |
| cat | a | 2 | 3 | NaN | NaN | NaN |
Desired result
| name | col_1 | col_2 | col_3 | col_4 | col_5 |
|---|---|---|---|---|---|
| ann | 1 | NaN | 2 | 2 | 3 |
| ben | 0 | 1 | 2 | 3 | NaN |
| cat | 2 | 3 | NaN | NaN | NaN |
A bit of an ad-hoc approach, but one way you could do this is by sorting your dataframe by name and group, then split up your df into 2 parts (i.e. col1/2/3 for group a), perform a groupby and use first() on the first df and last() on the second, and then put back together using concat:
import pandas as pd
df.sort_values(['name','group'],ascending=True,inplace=True)
one = df.iloc[:,:5]
two = pd.concat([df.iloc[:,:2],df[['col_4','col_5']]],axis=1)
def my_func(df1,df2,col):
return pd.concat([df1.groupby(col).first(),
df2.groupby(col).last()],
axis=1).drop('group',axis=1)
res = my_func(one,two,'name')
Which prints:
print(res)
col_1 col_2 col_3 col_4 col_5
name
ann 1 NaN 2.00 2.00 3.00
ben 0 1.00 2.00 3.00 NaN
cat 2 3.00 NaN NaN NaN
If i'm not mistaken this gets you what you need.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With