Np.where has been giving me a lot of errors, so I am looking for a solution with df.loc instead.
This is the np.where error I have been getting:
C:\Users\xxx\AppData\Local\Continuum\Anaconda2\lib\site-packages\ipykernel\__main__.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
if __name__ == '__main__':
I am working with the following dataframe df:
df = pd.DataFrame({'Column_A': ['AAA','AAA','ABC','CDE'],'checked': ['0','0','1','0'],'duplicate': ['True','True','False','False']})
Column_A checked duplicate
0 AAA 0 True
1 AAA 0 True
2 ABC 1 False
3 CDE 0 False
I want to create an additional flag, if checked is 0 and duplicate is True.
I tried this and it didn't work:
df['flag'] = (np.where((df['checked'] == 'Y') &(df['duplicate'] == 'True'), 'Y', '0'))
TypeError: invalid type comparison
I tried it with df.loc:
df['flag'] = (df.loc[df['checked'] == 'Y']& df.loc[df['duplicate'] == 'True'], 'Y','0')
TypeError: invalid type comparison
and I get the same error!
where is usually faster because working with NumPy directly avoids some pandas overheads. OTOH, using loc is considered the pandaic way of doing things.
The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.
I think your boolean
are not string
s, so need remove '
:
df = pd.DataFrame({'Column_A': ['AAA','AAA','ABC','CDE'],
'checked': ['0','0','1','0'],
'duplicate': [True, True, False, False]})
df['flag'] = np.where((df['checked'] == 'Y') &(df['duplicate'] == True), 'Y', '0')
print (df)
Column_A checked duplicate flag
0 AAA 0 True 0
1 AAA 0 True 0
2 ABC 1 False 0
3 CDE 0 False 0
Or if compare with boolean
column, == True
can be omited:
df['flag'] = np.where((df['checked'] == 'Y') &(df['duplicate']), 'Y', '0')
print (df)
Column_A checked duplicate flag
0 AAA 0 True 0
1 AAA 0 True 0
2 ABC 1 False 0
3 CDE 0 False 0
Also if need check checked
need '
because strings
:
df['flag'] = np.where((df['checked'] == '0') &(df['duplicate'] == True), 'Y', '0')
print (df)
Column_A checked duplicate flag
0 AAA 0 True Y
1 AAA 0 True Y
2 ABC 1 False 0
3 CDE 0 False 0
EDIT:
Solution with loc
:
df['flag'] = '0'
mask = (df['checked'] == '0') &(df['duplicate'])
df.loc[mask, 'flag'] = 'Y'
print (df)
Column_A checked duplicate flag
0 AAA 0 True Y
1 AAA 0 True Y
2 ABC 1 False 0
3 CDE 0 False 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With