Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does Pandas Dataframe.where method return NaN after calling dropna()?

Tags:

python

pandas

import pandas as pd

df = pd.read_csv('file.csv')
df.dropna(inplace=True)

filter1 = df['col1'] == 'some_value'
filter2 = df['col2'] == 'some_other_value'

df.where(filter1 & filter2, inplace=True)

df.head()

     localCountry localState remoteCountry remoteState  ... col1 col2 col3 num_samples
1250          NaN        NaN           NaN         NaN  ...           NaN           NaN            NaN         NaN
1251          NaN        NaN           NaN         NaN  ...           NaN           NaN            NaN         NaN

I thought dropna() would remove all rows that contained at least one NaN. Why is there NaN in the result here? A misunderstanding on my part for sure, but I haven't been able to logic out why this is the case after using dropna prior to the where method.

Edit for others:

The where() method replaces the values where the condition passed is false. If you don't supply a default, it will replace them with NaN. It is not just a query to return rows where both conditions are met.

Dataframe.where

Where cond is True, keep the original value. Where False, replace with corresponding value from other. If cond is callable, it is computed on the Series/DataFrame and should return boolean Series/DataFrame or array. The callable must not change input Series/DataFrame (though pandas doesn’t check it).

like image 755
Kris Avatar asked Sep 16 '25 04:09

Kris


1 Answers

I think the problem is you are missing the default value of DataFrame.where for setting rows not matched by any of the conditions (other):

df = pd.DataFrame({'col1':['some_value','some_value','aaa','dd'],
                   'col2':['some_other_value','dd','some_other_value', 'bb'],
                   'col3':list('abcd')})
print (df)
         col1              col2  col3
0  some_value  some_other_value     0
1  some_value                dd     1
2         aaa  some_other_value     2
3          dd                bb     3

filter1 = df['col1'] == 'some_value'
filter2 = df['col2'] == 'some_other_value'
df.where(filter1 & filter2, other=np.nan, inplace=True)
print (df)
0  some_value  some_other_value    a
1         NaN               NaN  NaN
2         NaN               NaN  NaN
3         NaN               NaN  NaN

If you change the replacement value:

df.where(filter1 & filter2, other='val', inplace=True)
         col1              col2 col3
0  some_value  some_other_value    a
1         val               val  val
2         val               val  val
3         val               val  val

If you want to filter the rows, then use boolean indexing:

df1 = df[filter1 & filter2]
like image 147
jezrael Avatar answered Sep 19 '25 08:09

jezrael