I've just found out about this strange behaviour of mask, could someone explain this to me?
A) [input]
df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
df['C'] ='hi'
df.mask(df[['A', 'B']]<3, inplace=True)
[output]
| A | B | C | |
|---|---|---|---|
| 0 | NaN | NaN | hi |
| 1 | NaN | 3.0 | hi |
| 2 | 4.0 | 5.0 | hi |
| 3 | 6.0 | 7.0 | hi |
| 4 | 8.0 | 9.0 | hi |
B) [input]
df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
df['C'] ='hi'
df.mask(df[['A', 'B']]<3)
[output]
| A | B | C | |
|---|---|---|---|
| 0 | NaN | NaN | NaN |
| 1 | NaN | 3.0 | NaN |
| 2 | 4.0 | 5.0 | NaN |
| 3 | 6.0 | 7.0 | NaN |
| 4 | 8.0 | 9.0 | NaN |
Thank you in advance
The root cause of different result is that you pass a boolean dataframe that is not the same shape as the dataframe you want to mask. df.mask() fill the missing part with the value of inplace.
From the sourcecode, you can see pandas.DataFrame.mask() calls pandas.DataFrame.where() internally. pandas.DataFrame.where() then calls a _where() method that replaces values where the condition is False.
I just take df.where() as an example, here is the example code:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(12).reshape(-1, 3), columns=['A', 'B', 'C'])
df1 = df.where(df[['A', 'B']]<3)
df.where(df[['A', 'B']]<3, inplace=True)
In this example, the df is
A B C
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
df[['A', 'B']]<3, the value of cond argument, is
A B
0 True True
1 False False
2 False False
3 False False
Digging into _where() method, the following lines are the key part:
def _where(...):
# align the cond to same shape as myself
cond = com.apply_if_callable(cond, self)
if isinstance(cond, NDFrame):
cond, _ = cond.align(self, join="right", broadcast_axis=1)
...
# make sure we are boolean
fill_value = bool(inplace)
cond = cond.fillna(fill_value)
Since the shape of cond and df are different, cond.align() fills the missing with NaN value. After that, cond looks like
A B C
0 True True NaN
1 False False NaN
2 False False NaN
3 False False NaN
Then with cond.fillna(fill_value), the NaN values are replaced with the value of inplace. So C column has the same value with inplace value.
Though there are still some codes (L9048 and L9124-L9145) related with inplace. We needn't care about the detail, since the aim of these lines are to replace values where the condition is False.
Recall that the df is
A B C
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
df1=df.where(df[['A', 'B']]<3): The cond C column is False since the default value of inplace is False. After doing df.where(), the df C column is set to the value of other argument which is NaN by default.df.where(df[['A', 'B']]<3, inplace=True): The cond C column is True. After doing df.where(), the df C column keeps the same.# print(df1)
A B C
0 0.0 1.0 NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
# print(df) after df.where(df[['A', 'B']]<3, inplace=True)
A B C
0 0.0 1.0 2
1 NaN NaN 5
2 NaN NaN 8
3 NaN NaN 11
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With