I want to use a 2D boolean mask to selectively alter some cells in a <code>pandas</code> <code>DataFrame</code>. I noticed that I cannot use a <code>numpy</code> array (successfully) as the mask, but I can use a <code>DataFrame</code>. More frustrating, however, is that I don't get an error with the <code>numpy</code> approach. For example, <pre class="prettyprint"><code>df = pd.DataFrame({'A':[1,2,3,4], 'B':[10,20,30,40]}) mask_np = np.array([[True,True], [False,False], [True,False], [False,True]]) mask_pd = pd.DataFrame(mask_np, columns=['A','B']) </code></pre> I would think either mask would return the values from <code>df</code> wherever the mask was <code>True</code>. But instead, <code>df[mask_np]</code> produces <pre class="prettyprint"><code> A B 0 1 10 0 1 10 2 3 30 3 4 40 </code></pre> which is not what I expect, nor can I explain. On the other hand, <code>df[mask_pd]</code> produces <pre class="prettyprint"><code> A B 0 1.0 10.0 1 NaN NaN 2 3.0 NaN 3 NaN 40.0 </code></pre> which is what I expect and want. Why can't I use the <code>numpy</code> mask? My internet search turned up nothing relevant. Any explanation behind this difference would be greatly appreciated! [<code>pandas</code> version 0.20.3; Python 3.6.3]

The source code suggests why. The <code>__getitem__</code> method, for which <code>[]</code> is syntactic sugar, checks specifically for indexing via a dataframe: <pre class="prettyprint"><code>elif isinstance(key, DataFrame): return self._getitem_frame(key) </code></pre> The <code>_getitem_frame</code> method called then returns <code>pd.DataFrame.where</code> if the dataframe is of Boolean type: <pre class="prettyprint"><code>def _getitem_frame(self, key): if key.values.size and not is_bool_dtype(key.values): raise ValueError('Must pass DataFrame with boolean values only') return self.where(key) </code></pre> The route taken for NumPy arrays, <code>_getitem_array</code>, is different and more convoluted. For some reason, the code is designed to treat NumPy / Pandas inputs differently, rather than to ensure consistency for the same data types. <hr> Regular Boolean indexing with a Pandas dataframe is usually applied along an axis, i.e. by rows / axis 0 via <code>df.loc[mask, :]</code> or columns / axis 1 via <code>df.loc[:, mask]</code>. Note you can, and probably should, access <code>pd.DataFrame.where</code> directly for clarity: <pre class="prettyprint"><code>res = df.where(mask_np) print(res) A B 0 1.0 10.0 1 NaN NaN 2 3.0 NaN 3 NaN 40.0 </code></pre>

Write down the row indices of the <code>True</code>'s in your <code>mask_np</code>: row <code>0</code>, row <code>0</code>, row <code>2</code>, row <code>3</code>. Select the rows with the same indices in <code>df</code> and concatenate them. That's how <code>df[mask_np]</code> is produced. This is probably a Pandas bug, since it's assumed in the source code that the array used for indexing is 1-dimensional. <hr> Looking at the source code (Pandas 0.23.4), <pre class="prettyprint"><code>df[mask_np] </code></pre> is equivalent to <pre class="prettyprint"><code>df._getitem_bool_array(mask_np) </code></pre> is equivalent to <pre class="prettyprint"><code>indexer = mask_np.nonzero()[0] df._take(indexer, axis=0) </code></pre> with the following evaluation: <pre class="prettyprint"><code>>>> mask_np.nonzero() (array([0, 0, 2, 3]), array([0, 1, 0, 1])) </code></pre> This tuple of arrays represents indices of nonzero elements along the dimensions of the array. In this case, the elements of first array in the tuple (eventually used in <code>df._take</code>) are 'row' indices of <code>True</code>'s in <code>mask_df</code>. The first array is used to <code>take</code> along the index, so you get rows <code>0, 0, 2, 3</code> of <code>df</code> in return.

Masking a pandas DataFrame with a numpy array vs DataFrame

Tags:

python

pandas

dataframe

numpy

I want to use a 2D boolean mask to selectively alter some cells in a pandas DataFrame. I noticed that I cannot use a numpy array (successfully) as the mask, but I can use a DataFrame. More frustrating, however, is that I don't get an error with the numpy approach.

For example,

df = pd.DataFrame({'A':[1,2,3,4], 
                   'B':[10,20,30,40]})

mask_np = np.array([[True,True],
                    [False,False],
                    [True,False],
                    [False,True]])

mask_pd = pd.DataFrame(mask_np, columns=['A','B'])

I would think either mask would return the values from df wherever the mask was True. But instead, df[mask_np] produces

which is not what I expect, nor can I explain. On the other hand, df[mask_pd] produces

     A     B
0  1.0  10.0
1  NaN   NaN
2  3.0   NaN
3  NaN  40.0

which is what I expect and want.

Why can't I use the numpy mask? My internet search turned up nothing relevant. Any explanation behind this difference would be greatly appreciated!

[pandas version 0.20.3; Python 3.6.3]

789

asked Aug 31 '18 22:08

Justin

2 Answers

The source code suggests why. The __getitem__ method, for which [] is syntactic sugar, checks specifically for indexing via a dataframe:

elif isinstance(key, DataFrame):
    return self._getitem_frame(key)

The _getitem_frame method called then returns pd.DataFrame.where if the dataframe is of Boolean type:

def _getitem_frame(self, key):
    if key.values.size and not is_bool_dtype(key.values):
        raise ValueError('Must pass DataFrame with boolean values only')
    return self.where(key)

The route taken for NumPy arrays, _getitem_array, is different and more convoluted. For some reason, the code is designed to treat NumPy / Pandas inputs differently, rather than to ensure consistency for the same data types.

Regular Boolean indexing with a Pandas dataframe is usually applied along an axis, i.e. by rows / axis 0 via df.loc[mask, :] or columns / axis 1 via df.loc[:, mask].

Note you can, and probably should, access pd.DataFrame.where directly for clarity:

res = df.where(mask_np)

print(res)

     A     B
0  1.0  10.0
1  NaN   NaN
2  3.0   NaN
3  NaN  40.0

170

answered Oct 20 '22 01:10

jpp

Write down the row indices of the True's in your mask_np: row 0, row 0, row 2, row 3. Select the rows with the same indices in df and concatenate them. That's how df[mask_np] is produced.

This is probably a Pandas bug, since it's assumed in the source code that the array used for indexing is 1-dimensional.

Looking at the source code (Pandas 0.23.4),

df[mask_np]

is equivalent to

df._getitem_bool_array(mask_np)

is equivalent to

indexer = mask_np.nonzero()[0]
df._take(indexer, axis=0)

with the following evaluation:

>>> mask_np.nonzero()
(array([0, 0, 2, 3]), array([0, 1, 0, 1]))

This tuple of arrays represents indices of nonzero elements along the dimensions of the array. In this case, the elements of first array in the tuple (eventually used in df._take) are 'row' indices of True's in mask_df.

The first array is used to take along the index, so you get rows 0, 0, 2, 3 of df in return.

answered Oct 20 '22 00:10

Andrey Portnoy

Related questions
                            
                                How to get Voronoi site points from a diagram
                            
                                Android LiveData Observer not being triggered second time
                            
                                Read in .xrdml data within a complex array
                            
                                Applying custom function to each row uses only first value of argument
                            
                                Error: failed to create deliver client: orderer client failed to connect to orderer: failed to create new connection: context deadline exceeded
                            
                                spyder is showing error after update [you have missing dependencies]
                            
                                How to add webhooks in gitlab for multibranch pipeline jenkins
                            
                                Can I write a file to a linux directory specifying file permissions in cfscript?
                            
                                What does startAngle mean in an HTML5 canvas ellipse?
                            
                                How to exclude null values from Mongoose populate query
                            
                                Unable to parse a valid JSON
                            
                                How do I pass arguments to custom static type hints in Python 3?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With