I have a data frame like the following:
test = pd.DataFrame({'ID':[4, 5, 6, 6, 6, 7, 7, 7], 'val1':['one', 'one', 'two', 'two', 'three', np.nan, 'seven', 'seven'], 'val2':['hi', 'bye', 'hola', 'hola', 'hola', 'ciao', 'ciao', 'namaste'], 'val3':[3, 3, 4, np.nan, 4, 5, 5, 6]})
test
ID val1 val2 val3
0 4 one hi 3.0
1 5 one bye 3.0
2 6 two hola 4.0
3 6 two hola NaN
4 6 three hola 4.0
5 7 NaN ciao 5.0
6 7 seven ciao 5.0
7 7 seven namaste 6.0
Each ID has some measured values, with some IDs being done in triplicate.
If there is any disagreement between the replicate IDs for a specific column, then I want the new data frame to have an NaN for that value.
If there is an NaN already present for one value (consider it not measured), but the other two for that replicate sample match, then I want that agreement to be present in the final data frame. If there is disagreement between the two where values are present, then NaN.
I was thinking of using pandas groupby then aggregate for this, but I wasn't sure of how to do the logic for the aggregate function.
Essentially the output I am looking for is like:
pd.DataFrame({'ID':[4, 5, 6, 7], 'val1':['one', 'one', np.nan, 'seven'], 'val2':['hi', 'bye', 'hola', np.nan], 'val3':[3, 3, 4, np.nan]})
ID val1 val2 val3
0 4 one hi 3.0
1 5 one bye 3.0
2 6 NaN hola 4.0
3 7 seven NaN NaN
Could you suggest how to do this?
Thanks!
Jack
Using
test.groupby('ID',as_index=False).agg(lambda x : x.mode()[0] if x.nunique()==1 else np.nan)
Out[372]:
ID val1 val2 val3
0 4 one hi 3.0
1 5 one bye 3.0
2 6 NaN hola 4.0
3 7 seven NaN NaN
This works because of how you've defined your problem.
First, get the first row of each ID. Next, figure out what IDs have valid values and mask everything else.
v = df.groupby('ID', as_index=False).first()
v[df.groupby('ID', as_index=False).nunique().eq(1)]
ID val1 val2 val3
0 4 one hi 3.0
1 5 one bye 3.0
2 6 NaN hola 4.0
3 7 seven NaN NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With