Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Label columns based on grouped data

Tags:

python

pandas

I am trying to create a column that consists of unique values for each id (each id has numerous rows associated with it), if the id has the tag answered associated with any of its rows, all the rows associated with that id should be labelled answered. If all the rows associated with an id have an unanswered tag, all the rows should be labelled as unanswered (this is currently what happens)

This is code I have written:

import numpy as np

conds = [file.data__answered_at.isna(),file.data__answered_at.notna()]
choices = ["not answered","answered"]
file['call_status'] = np.select(conds,choices,default=np.nan)

 data__id   call_status       rank
  1            answered        1
  1          not_answered      2
  1            answered        3
  2          not_answered      1
  2             answered       2
  3          not_answered      1
  4            answered        1
  4          not_answered      2
  5          not_answered      1
  5          not_answered      2

In this case the desired outcome would be

   data__id   call_status       rank
  1            answered        1
  1            answered        2
  1            answered        3
  2            answered        1
  2            answered        2
  3          not_answered      1
  4            answered        1
  4            answered        2
  5          not_answered      1
  5          not_answered      2
like image 587
Emm Avatar asked Jan 26 '23 01:01

Emm


2 Answers

Use GroupBy.transform with GroupBy.any for test at least one answered per groups and set values by DataFrame.loc:

mask = df['call_status'].eq('answered').groupby(df['data__id']).transform('any')

Or get all data__id filtered by another column and test membership by Series.isin:

mask = df['data__id'].isin(df.loc[df['call_status'].eq('answered'), 'data__id'].unique())

df.loc[mask, 'call_status'] = 'answered'
print (df)
   data__id   call_status  rank
0         1      answered     1
1         1      answered     2
2         1      answered     3
3         2      answered     1
4         2      answered     2
5         3  not_answered     1
6         4      answered     1
7         4      answered     2
8         5  not_answered     1
9         5  not_answered     2
like image 181
jezrael Avatar answered Jan 28 '23 15:01

jezrael


We can use groupby here and check if any of the rows are equal to answered.

Then we use np.where to conditionally fill in answered or not_answered

m = file.groupby('data__id')['call_status'].transform(lambda x: x.eq('answered').any())

file['call_status'] = np.where(m, 'answered', 'not_answered')

Output

  data__id   call_status  rank
0         1      answered     1
1         1      answered     2
2         1      answered     3
3         2      answered     1
4         2      answered     2
5         3  not_answered     1
6         4      answered     1
7         4      answered     2
8         5  not_answered     1
9         5  not_answered     2
like image 27
Erfan Avatar answered Jan 28 '23 15:01

Erfan