I've just found out about this strange behaviour of mask, could someone explain this to me?
A) [input]
df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
df['C'] ='hi'
df.mask(df[['A', 'B']]<3, inplace=True)
[output]
A | B | C | |
---|---|---|---|
0 | NaN | NaN | hi |
1 | NaN | 3.0 | hi |
2 | 4.0 | 5.0 | hi |
3 | 6.0 | 7.0 | hi |
4 | 8.0 | 9.0 | hi |
B) [input]
df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
df['C'] ='hi'
df.mask(df[['A', 'B']]<3)
[output]
A | B | C | |
---|---|---|---|
0 | NaN | NaN | NaN |
1 | NaN | 3.0 | NaN |
2 | 4.0 | 5.0 | NaN |
3 | 6.0 | 7.0 | NaN |
4 | 8.0 | 9.0 | NaN |
Thank you in advance
The root cause of different result is that you pass a boolean dataframe that is not the same shape as the dataframe you want to mask. df.mask()
fill the missing part with the value of inplace
.
From the sourcecode, you can see pandas.DataFrame.mask() calls pandas.DataFrame.where() internally. pandas.DataFrame.where()
then calls a _where() method that replaces values where the condition is False.
I just take df.where()
as an example, here is the example code:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(12).reshape(-1, 3), columns=['A', 'B', 'C'])
df1 = df.where(df[['A', 'B']]<3)
df.where(df[['A', 'B']]<3, inplace=True)
In this example, the df
is
A B C
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
df[['A', 'B']]<3
, the value of cond
argument, is
A B
0 True True
1 False False
2 False False
3 False False
Digging into _where()
method, the following lines are the key part:
def _where(...):
# align the cond to same shape as myself
cond = com.apply_if_callable(cond, self)
if isinstance(cond, NDFrame):
cond, _ = cond.align(self, join="right", broadcast_axis=1)
...
# make sure we are boolean
fill_value = bool(inplace)
cond = cond.fillna(fill_value)
Since the shape of cond
and df
are different, cond.align()
fills the missing with NaN
value. After that, cond
looks like
A B C
0 True True NaN
1 False False NaN
2 False False NaN
3 False False NaN
Then with cond.fillna(fill_value)
, the NaN
values are replaced with the value of inplace
. So C column has the same value with inplace
value.
Though there are still some codes (L9048 and L9124-L9145) related with inplace
. We needn't care about the detail, since the aim of these lines are to replace values where the condition is False.
Recall that the df
is
A B C
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
df1=df.where(df[['A', 'B']]<3)
: The cond
C column is False since the default value of inplace
is False. After doing df.where()
, the df
C column is set to the value of other
argument which is NaN
by default.df.where(df[['A', 'B']]<3, inplace=True)
: The cond
C column is True. After doing df.where()
, the df
C column keeps the same.# print(df1)
A B C
0 0.0 1.0 NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
# print(df) after df.where(df[['A', 'B']]<3, inplace=True)
A B C
0 0.0 1.0 2
1 NaN NaN 5
2 NaN NaN 8
3 NaN NaN 11
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With