I asked a similar question here, but I want to expand on this question because I'm asked to do something a little different where I cannot use .duplicates()
I have a df that's grouped by 'Key'. I want to flag any row within a group where the discharge date matches the admit date AND between those rows, the row with the discharge date has a num1 value in the range of 5-12.
df = pd.DataFrame({'Key': ['10003', '10003', '10003', '10003', '10003','10003','10034', '10034'],
'Num1': [12,13,13,13,12,13,15,12],
'Num2': [121,122,122,124,125,126,127,128],
'admit': [20120506, 20120508, 20121010,20121010,20121010,20121110,20120520,20120520], 'discharge': [20120508, 20120510, 20121012,20121016,20121023,20121111,20120520,20120520]})
df['admit'] = pd.to_datetime(df['admit'], format='%Y%m%d')
df['discharge'] = pd.to_datetime(df['discharge'], format='%Y%m%d')
initial df
Key Num1 Num2 admit discharge
0 10003 12 121 2012-05-06 2012-05-08
1 10003 13 122 2012-05-08 2012-05-10
2 10003 13 122 2012-10-10 2012-10-12
3 10003 13 124 2012-10-10 2012-10-16
4 10003 12 125 2012-10-10 2012-10-23
5 10003 13 126 2012-11-10 2012-11-11
6 10034 15 127 2012-05-20 2012-05-20
7 10034 12 128 2012-05-20 2012-05-20
final df
Key Num1 Num2 admit discharge flag
0 10003 12 121 2012-05-06 2012-05-08 1
1 10003 13 122 2012-05-08 2012-05-10 1
2 10003 13 122 2012-10-10 2012-10-12 0
3 10003 13 124 2012-10-10 2012-10-16 0
4 10003 12 125 2012-10-10 2012-10-23 0
5 10003 13 126 2012-11-10 2012-11-11 0
6 10034 15 127 2012-05-20 2012-05-20 1
7 10034 12 128 2012-05-20 2012-05-20 1
I was trying to use filter() but I can't quite figure out how to apply any() to the discharge date. My logic was to pick the first admit date in a group and then check that date among each discharge date and once there is a match then check if the row that has the same discharge date has a value in Num1 in the range of 5-12.
num1_range = [5,6,7,8,9,10,11,12]
df.loc[df.groupby(['Key']).filter(lambda x : (x['admit'] == x['discharge'].any())&(x['Num1'].isin(num1_range).any())),'flag']=1
I'm getting an error
ValueError: cannot set a Timestamp with a non-timestamp
By using loc[] you can apply multiple conditions. Make sure you surround each condition with brac. Not using this will get you incorrect results.
Using Loc to Filter With Multiple Conditions The loc function in pandas can be used to access groups of rows or columns by label. Add each condition you want to be included in the filtered result and concatenate them with the & operator. You'll see our code sample will return a pd. dataframe of our filtered rows.
groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.
You can get pandas. Series of bool which is an AND of two conditions using & . Note that == and ~ are used here as the second condition for the sake of explanation, but you can use !=
I believe you are looking for either of 2 conditions to be satisfied for flag = True
:
Key
).Num1
is in the range 5 to 12 inclusive.The below logic produces the result in line with your desired output.
Solution
d1 = df.groupby('Key')['admit'].apply(set).to_dict()
d2 = df.groupby('Key')['discharge'].apply(set).to_dict()
def flagger(row):
match1, match2 = row['discharge'] in d1[row['Key']], row['admit'] in d2[row['Key']]
return match2 or (match1 and (row['Num1'] in range(5, 13)))
df['flag'] = df.apply(flagger, axis=1).astype(int)
Result
Key Num1 Num2 admit discharge flag
0 10003 12 121 2012-05-06 2012-05-08 1
1 10003 13 122 2012-05-08 2012-05-10 1
2 10003 13 122 2012-10-10 2012-10-12 0
3 10003 13 124 2012-10-10 2012-10-16 0
4 10003 12 125 2012-10-10 2012-10-23 0
5 10003 13 126 2012-11-10 2012-11-11 0
6 10034 15 127 2012-05-20 2012-05-20 1
7 10034 12 128 2012-05-20 2012-05-20 1
Explanation
pd.DataFrame.apply
.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With