Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use .loc with groupby and two conditions in pandas

Tags:

python

pandas

I asked a similar question here, but I want to expand on this question because I'm asked to do something a little different where I cannot use .duplicates()

I have a df that's grouped by 'Key'. I want to flag any row within a group where the discharge date matches the admit date AND between those rows, the row with the discharge date has a num1 value in the range of 5-12.

df =  pd.DataFrame({'Key': ['10003', '10003', '10003', '10003', '10003','10003','10034', '10034'], 
   'Num1': [12,13,13,13,12,13,15,12],
   'Num2': [121,122,122,124,125,126,127,128],
  'admit': [20120506, 20120508, 20121010,20121010,20121010,20121110,20120520,20120520],  'discharge': [20120508, 20120510, 20121012,20121016,20121023,20121111,20120520,20120520]})
df['admit'] = pd.to_datetime(df['admit'], format='%Y%m%d')
df['discharge'] = pd.to_datetime(df['discharge'], format='%Y%m%d')

initial df

    Key     Num1    Num2    admit       discharge
0   10003   12      121     2012-05-06  2012-05-08
1   10003   13      122     2012-05-08  2012-05-10
2   10003   13      122     2012-10-10  2012-10-12
3   10003   13      124     2012-10-10  2012-10-16
4   10003   12      125     2012-10-10  2012-10-23
5   10003   13      126     2012-11-10  2012-11-11
6   10034   15      127     2012-05-20  2012-05-20
7   10034   12      128     2012-05-20  2012-05-20

final df

    Key     Num1    Num2    admit       discharge   flag
0   10003   12      121     2012-05-06  2012-05-08  1
1   10003   13      122     2012-05-08  2012-05-10  1
2   10003   13      122     2012-10-10  2012-10-12  0
3   10003   13      124     2012-10-10  2012-10-16  0
4   10003   12      125     2012-10-10  2012-10-23  0
5   10003   13      126     2012-11-10  2012-11-11  0
6   10034   15      127     2012-05-20  2012-05-20  1
7   10034   12      128     2012-05-20  2012-05-20  1

I was trying to use filter() but I can't quite figure out how to apply any() to the discharge date. My logic was to pick the first admit date in a group and then check that date among each discharge date and once there is a match then check if the row that has the same discharge date has a value in Num1 in the range of 5-12.

num1_range = [5,6,7,8,9,10,11,12]
df.loc[df.groupby(['Key']).filter(lambda x : (x['admit'] == x['discharge'].any())&(x['Num1'].isin(num1_range).any())),'flag']=1

I'm getting an error

ValueError: cannot set a Timestamp with a non-timestamp
like image 331
Martin Bobak Avatar asked Mar 08 '18 03:03

Martin Bobak


People also ask

Can you use loc for multiple conditions?

By using loc[] you can apply multiple conditions. Make sure you surround each condition with brac. Not using this will get you incorrect results.

How do I use multiple conditions in pandas?

Using Loc to Filter With Multiple Conditions The loc function in pandas can be used to access groups of rows or columns by label. Add each condition you want to be included in the filtered result and concatenate them with the & operator. You'll see our code sample will return a pd. dataframe of our filtered rows.

Can you use Groupby with multiple columns in pandas?

groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.

How can pandas select rows based on multiple conditions?

You can get pandas. Series of bool which is an AND of two conditions using & . Note that == and ~ are used here as the second condition for the sake of explanation, but you can use !=


1 Answers

I believe you are looking for either of 2 conditions to be satisfied for flag = True:

  1. Admit date is equal to any discharge date within the group (Key).
  2. Discharge date is equal to any admit date within the group, provided Num1 is in the range 5 to 12 inclusive.

The below logic produces the result in line with your desired output.

Solution

d1 = df.groupby('Key')['admit'].apply(set).to_dict()
d2 = df.groupby('Key')['discharge'].apply(set).to_dict()

def flagger(row):
    match1, match2 = row['discharge'] in d1[row['Key']], row['admit'] in d2[row['Key']]
    return match2 or (match1 and (row['Num1'] in range(5, 13)))

df['flag'] = df.apply(flagger, axis=1).astype(int)

Result

     Key  Num1  Num2      admit  discharge  flag
0  10003    12   121 2012-05-06 2012-05-08     1
1  10003    13   122 2012-05-08 2012-05-10     1
2  10003    13   122 2012-10-10 2012-10-12     0
3  10003    13   124 2012-10-10 2012-10-16     0
4  10003    12   125 2012-10-10 2012-10-23     0
5  10003    13   126 2012-11-10 2012-11-11     0
6  10034    15   127 2012-05-20 2012-05-20     1
7  10034    12   128 2012-05-20 2012-05-20     1

Explanation

  • Create 2 dictionary mapping Key -> Admit dates and Key -> Discharge dates respectively.
  • Use these 2 dictionaries to apply the criteria specified by row using pd.DataFrame.apply.
like image 155
jpp Avatar answered Sep 17 '22 15:09

jpp