Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Masking a pandas DataFrame with a numpy array vs DataFrame

I want to use a 2D boolean mask to selectively alter some cells in a pandas DataFrame. I noticed that I cannot use a numpy array (successfully) as the mask, but I can use a DataFrame. More frustrating, however, is that I don't get an error with the numpy approach.

For example,

df = pd.DataFrame({'A':[1,2,3,4], 
                   'B':[10,20,30,40]})

mask_np = np.array([[True,True],
                    [False,False],
                    [True,False],
                    [False,True]])

mask_pd = pd.DataFrame(mask_np, columns=['A','B'])

I would think either mask would return the values from df wherever the mask was True. But instead, df[mask_np] produces

   A   B
0  1  10
0  1  10
2  3  30
3  4  40

which is not what I expect, nor can I explain. On the other hand, df[mask_pd] produces

     A     B
0  1.0  10.0
1  NaN   NaN
2  3.0   NaN
3  NaN  40.0

which is what I expect and want.

Why can't I use the numpy mask? My internet search turned up nothing relevant. Any explanation behind this difference would be greatly appreciated!

[pandas version 0.20.3; Python 3.6.3]

like image 789
Justin Avatar asked Aug 31 '18 22:08

Justin


People also ask

What is the difference between NumPy array and Pandas Dataframe?

Numpy is memory efficient. Pandas has a better performance when a number of rows is 500K or more. Numpy has a better performance when number of rows is 50K or less. Indexing of the pandas series is very slow as compared to numpy arrays.

Is NumPy always faster than Pandas?

pandas provides a bunch of C or Cython optimized functions that can be faster than the NumPy equivalent function (e.g. reading text from text files). If you want to do mathematical operations like a dot product, calculating mean, and some more, pandas DataFrames are generally going to be slower than a NumPy array.

What is NumPy masked array?

A masked array is the combination of a standard numpy. ndarray and a mask. A mask is either nomask , indicating that no value of the associated array is invalid, or an array of booleans that determines for each element of the associated array whether the value is valid or not.

How do I mask data in Pandas Python?

Pandas DataFrame mask() MethodThe mask() method replaces the values of the rows where the condition evaluates to True. The mask() method is the opposite of the The where() method.


2 Answers

The source code suggests why. The __getitem__ method, for which [] is syntactic sugar, checks specifically for indexing via a dataframe:

elif isinstance(key, DataFrame):
    return self._getitem_frame(key)

The _getitem_frame method called then returns pd.DataFrame.where if the dataframe is of Boolean type:

def _getitem_frame(self, key):
    if key.values.size and not is_bool_dtype(key.values):
        raise ValueError('Must pass DataFrame with boolean values only')
    return self.where(key)

The route taken for NumPy arrays, _getitem_array, is different and more convoluted. For some reason, the code is designed to treat NumPy / Pandas inputs differently, rather than to ensure consistency for the same data types.


Regular Boolean indexing with a Pandas dataframe is usually applied along an axis, i.e. by rows / axis 0 via df.loc[mask, :] or columns / axis 1 via df.loc[:, mask].

Note you can, and probably should, access pd.DataFrame.where directly for clarity:

res = df.where(mask_np)

print(res)

     A     B
0  1.0  10.0
1  NaN   NaN
2  3.0   NaN
3  NaN  40.0
like image 170
jpp Avatar answered Oct 20 '22 01:10

jpp


Write down the row indices of the True's in your mask_np: row 0, row 0, row 2, row 3. Select the rows with the same indices in df and concatenate them. That's how df[mask_np] is produced.

This is probably a Pandas bug, since it's assumed in the source code that the array used for indexing is 1-dimensional.


Looking at the source code (Pandas 0.23.4),

df[mask_np]

is equivalent to

df._getitem_bool_array(mask_np)

is equivalent to

indexer = mask_np.nonzero()[0]
df._take(indexer, axis=0)

with the following evaluation:

>>> mask_np.nonzero()
(array([0, 0, 2, 3]), array([0, 1, 0, 1]))

This tuple of arrays represents indices of nonzero elements along the dimensions of the array. In this case, the elements of first array in the tuple (eventually used in df._take) are 'row' indices of True's in mask_df.

The first array is used to take along the index, so you get rows 0, 0, 2, 3 of df in return.

like image 23
Andrey Portnoy Avatar answered Oct 20 '22 00:10

Andrey Portnoy