Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python: pandas np.where vs. df.loc with multiple conditions

Np.where has been giving me a lot of errors, so I am looking for a solution with df.loc instead.

This is the np.where error I have been getting:

C:\Users\xxx\AppData\Local\Continuum\Anaconda2\lib\site-packages\ipykernel\__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':

I am working with the following dataframe df:

df = pd.DataFrame({'Column_A': ['AAA','AAA','ABC','CDE'],'checked': ['0','0','1','0'],'duplicate': ['True','True','False','False']})

    Column_A    checked   duplicate
0   AAA             0      True
1   AAA             0      True
2   ABC             1      False
3   CDE             0      False

I want to create an additional flag, if checked is 0 and duplicate is True.

I tried this and it didn't work:

df['flag'] = (np.where((df['checked'] == 'Y') &(df['duplicate'] == 'True'), 'Y', '0'))

TypeError: invalid type comparison

I tried it with df.loc:

df['flag'] = (df.loc[df['checked'] == 'Y']& df.loc[df['duplicate'] == 'True'], 'Y','0')

TypeError: invalid type comparison

and I get the same error!

like image 854
jeangelj Avatar asked Jun 15 '17 14:06

jeangelj


People also ask

Which is faster NP where or loc?

where is usually faster because working with NumPy directly avoids some pandas overheads. OTOH, using loc is considered the pandaic way of doing things.

Is pandas query faster than loc?

The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.


1 Answers

I think your boolean are not strings, so need remove ':

df = pd.DataFrame({'Column_A': ['AAA','AAA','ABC','CDE'],
                  'checked': ['0','0','1','0'],
                  'duplicate': [True, True, False, False]})

df['flag'] = np.where((df['checked'] == 'Y') &(df['duplicate'] == True), 'Y', '0')
print (df)
  Column_A checked  duplicate flag
0      AAA       0       True    0
1      AAA       0       True    0
2      ABC       1      False    0
3      CDE       0      False    0

Or if compare with boolean column, == True can be omited:

df['flag'] = np.where((df['checked'] == 'Y') &(df['duplicate']), 'Y', '0')
print (df)
  Column_A checked  duplicate flag
0      AAA       0       True    0
1      AAA       0       True    0
2      ABC       1      False    0
3      CDE       0      False    0

Also if need check checked need ' because strings:

df['flag'] = np.where((df['checked'] == '0') &(df['duplicate'] == True), 'Y', '0')
print (df)
  Column_A checked  duplicate flag
0      AAA       0       True    Y
1      AAA       0       True    Y
2      ABC       1      False    0
3      CDE       0      False    0

EDIT:

Solution with loc:

df['flag'] = '0'
mask = (df['checked'] == '0') &(df['duplicate'])
df.loc[mask, 'flag'] = 'Y'
print (df)
  Column_A checked  duplicate flag
0      AAA       0       True    Y
1      AAA       0       True    Y
2      ABC       1      False    0
3      CDE       0      False    0
like image 120
jezrael Avatar answered Sep 28 '22 18:09

jezrael