Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group by and aggregate columns but create NaN if values do not match

I have a data frame like the following:

test = pd.DataFrame({'ID':[4, 5, 6, 6, 6, 7, 7, 7], 'val1':['one', 'one', 'two', 'two', 'three', np.nan, 'seven', 'seven'], 'val2':['hi', 'bye', 'hola', 'hola', 'hola', 'ciao', 'ciao', 'namaste'], 'val3':[3, 3, 4, np.nan, 4, 5, 5, 6]})

test
   ID   val1     val2  val3
0   4    one       hi   3.0
1   5    one      bye   3.0
2   6    two     hola   4.0
3   6    two     hola   NaN
4   6  three     hola   4.0
5   7    NaN     ciao   5.0
6   7  seven     ciao   5.0
7   7  seven  namaste   6.0

Each ID has some measured values, with some IDs being done in triplicate.

If there is any disagreement between the replicate IDs for a specific column, then I want the new data frame to have an NaN for that value.

If there is an NaN already present for one value (consider it not measured), but the other two for that replicate sample match, then I want that agreement to be present in the final data frame. If there is disagreement between the two where values are present, then NaN.

I was thinking of using pandas groupby then aggregate for this, but I wasn't sure of how to do the logic for the aggregate function.

Essentially the output I am looking for is like:

pd.DataFrame({'ID':[4, 5, 6, 7], 'val1':['one', 'one', np.nan, 'seven'], 'val2':['hi', 'bye', 'hola',  np.nan], 'val3':[3, 3, 4, np.nan]})

   ID   val1  val2  val3
0   4    one    hi   3.0
1   5    one   bye   3.0
2   6    NaN  hola   4.0
3   7  seven   NaN   NaN

Could you suggest how to do this?

Thanks!

Jack

like image 876
Jack Arnestad Avatar asked Aug 20 '18 16:08

Jack Arnestad


2 Answers

Using

test.groupby('ID',as_index=False).agg(lambda x : x.mode()[0] if x.nunique()==1 else np.nan)
Out[372]: 
   ID   val1  val2  val3
0   4    one    hi   3.0
1   5    one   bye   3.0
2   6    NaN  hola   4.0
3   7  seven   NaN   NaN
like image 113
BENY Avatar answered Oct 07 '22 00:10

BENY


This works because of how you've defined your problem.

First, get the first row of each ID. Next, figure out what IDs have valid values and mask everything else.

v = df.groupby('ID', as_index=False).first()
v[df.groupby('ID', as_index=False).nunique().eq(1)]

   ID   val1  val2  val3
0   4    one    hi   3.0
1   5    one   bye   3.0
2   6    NaN  hola   4.0
3   7  seven   NaN   NaN
like image 44
cs95 Avatar answered Oct 07 '22 01:10

cs95