Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is mask working differently using inplace?

I've just found out about this strange behaviour of mask, could someone explain this to me?

A) [input]

df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
df['C'] ='hi'
df.mask(df[['A', 'B']]<3, inplace=True)

[output]

A B C
0 NaN NaN hi
1 NaN 3.0 hi
2 4.0 5.0 hi
3 6.0 7.0 hi
4 8.0 9.0 hi

B) [input]

df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
df['C'] ='hi'
df.mask(df[['A', 'B']]<3)

[output]

A B C
0 NaN NaN NaN
1 NaN 3.0 NaN
2 4.0 5.0 NaN
3 6.0 7.0 NaN
4 8.0 9.0 NaN

Thank you in advance

like image 326
JeB Avatar asked Mar 04 '21 11:03

JeB


1 Answers

The root cause of different result is that you pass a boolean dataframe that is not the same shape as the dataframe you want to mask. df.mask() fill the missing part with the value of inplace.

From the sourcecode, you can see pandas.DataFrame.mask() calls pandas.DataFrame.where() internally. pandas.DataFrame.where() then calls a _where() method that replaces values where the condition is False.

I just take df.where() as an example, here is the example code:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.arange(12).reshape(-1, 3), columns=['A', 'B', 'C'])

df1 = df.where(df[['A', 'B']]<3)

df.where(df[['A', 'B']]<3, inplace=True)

In this example, the df is

   A   B   C
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11

df[['A', 'B']]<3, the value of cond argument, is

       A      B
0   True   True
1  False  False
2  False  False
3  False  False

Digging into _where() method, the following lines are the key part:

    def _where(...):
        # align the cond to same shape as myself
        cond = com.apply_if_callable(cond, self)
        if isinstance(cond, NDFrame):
            cond, _ = cond.align(self, join="right", broadcast_axis=1)
        ...
        # make sure we are boolean
        fill_value = bool(inplace)
        cond = cond.fillna(fill_value)

Since the shape of cond and df are different, cond.align() fills the missing with NaN value. After that, cond looks like

       A      B   C
0   True   True NaN
1  False  False NaN
2  False  False NaN
3  False  False NaN

Then with cond.fillna(fill_value), the NaN values are replaced with the value of inplace. So C column has the same value with inplace value.

Though there are still some codes (L9048 and L9124-L9145) related with inplace. We needn't care about the detail, since the aim of these lines are to replace values where the condition is False.

Recall that the df is

   A   B   C
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11
  • df1=df.where(df[['A', 'B']]<3): The cond C column is False since the default value of inplace is False. After doing df.where(), the df C column is set to the value of other argument which is NaN by default.
  • df.where(df[['A', 'B']]<3, inplace=True): The cond C column is True. After doing df.where(), the df C column keeps the same.
# print(df1)
     A    B   C
0  0.0  1.0 NaN
1  NaN  NaN NaN
2  NaN  NaN NaN
3  NaN  NaN NaN

# print(df) after df.where(df[['A', 'B']]<3, inplace=True)
     A    B   C
0  0.0  1.0   2
1  NaN  NaN   5
2  NaN  NaN   8
3  NaN  NaN  11
like image 80
Ynjxsjmh Avatar answered Oct 24 '22 19:10

Ynjxsjmh