I want to use a 2D boolean mask to selectively alter some cells in a pandas
DataFrame
. I noticed that I cannot use a numpy
array (successfully) as the mask, but I can use a DataFrame
. More frustrating, however, is that I don't get an error with the numpy
approach.
For example,
df = pd.DataFrame({'A':[1,2,3,4],
'B':[10,20,30,40]})
mask_np = np.array([[True,True],
[False,False],
[True,False],
[False,True]])
mask_pd = pd.DataFrame(mask_np, columns=['A','B'])
I would think either mask would return the values from df
wherever the mask was True
. But instead, df[mask_np]
produces
A B
0 1 10
0 1 10
2 3 30
3 4 40
which is not what I expect, nor can I explain. On the other hand, df[mask_pd]
produces
A B
0 1.0 10.0
1 NaN NaN
2 3.0 NaN
3 NaN 40.0
which is what I expect and want.
Why can't I use the numpy
mask? My internet search turned up nothing relevant. Any explanation behind this difference would be greatly appreciated!
[pandas
version 0.20.3; Python 3.6.3]
Numpy is memory efficient. Pandas has a better performance when a number of rows is 500K or more. Numpy has a better performance when number of rows is 50K or less. Indexing of the pandas series is very slow as compared to numpy arrays.
pandas provides a bunch of C or Cython optimized functions that can be faster than the NumPy equivalent function (e.g. reading text from text files). If you want to do mathematical operations like a dot product, calculating mean, and some more, pandas DataFrames are generally going to be slower than a NumPy array.
A masked array is the combination of a standard numpy. ndarray and a mask. A mask is either nomask , indicating that no value of the associated array is invalid, or an array of booleans that determines for each element of the associated array whether the value is valid or not.
Pandas DataFrame mask() MethodThe mask() method replaces the values of the rows where the condition evaluates to True. The mask() method is the opposite of the The where() method.
The source code suggests why. The __getitem__
method, for which []
is syntactic sugar, checks specifically for indexing via a dataframe:
elif isinstance(key, DataFrame):
return self._getitem_frame(key)
The _getitem_frame
method called then returns pd.DataFrame.where
if the dataframe is of Boolean type:
def _getitem_frame(self, key):
if key.values.size and not is_bool_dtype(key.values):
raise ValueError('Must pass DataFrame with boolean values only')
return self.where(key)
The route taken for NumPy arrays, _getitem_array
, is different and more convoluted. For some reason, the code is designed to treat NumPy / Pandas inputs differently, rather than to ensure consistency for the same data types.
Regular Boolean indexing with a Pandas dataframe is usually applied along an axis, i.e. by rows / axis 0 via df.loc[mask, :]
or columns / axis 1 via df.loc[:, mask]
.
Note you can, and probably should, access pd.DataFrame.where
directly for clarity:
res = df.where(mask_np)
print(res)
A B
0 1.0 10.0
1 NaN NaN
2 3.0 NaN
3 NaN 40.0
Write down the row indices of the True
's in your mask_np
: row 0
, row 0
, row 2
, row 3
. Select the rows with the same indices in df
and concatenate them. That's how df[mask_np]
is produced.
This is probably a Pandas bug, since it's assumed in the source code that the array used for indexing is 1-dimensional.
Looking at the source code (Pandas 0.23.4),
df[mask_np]
is equivalent to
df._getitem_bool_array(mask_np)
is equivalent to
indexer = mask_np.nonzero()[0]
df._take(indexer, axis=0)
with the following evaluation:
>>> mask_np.nonzero()
(array([0, 0, 2, 3]), array([0, 1, 0, 1]))
This tuple of arrays represents indices of nonzero elements along the dimensions of the array. In this case, the elements of first array in the tuple (eventually used in df._take
) are 'row' indices of True
's in mask_df
.
The first array is used to take
along the index, so you get rows 0, 0, 2, 3
of df
in return.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With