Why are Numpy masked arrays useful?

Tags:

I've been reading through the masked array documentation and I'm confused - what is different about MaskedArray than just maintaining an array of values and a boolean mask? Can someone give me an example where MaskedArrays are way more convenient, or higher performing?

Update 6/5

To be more concrete about my question, here is the classic example of how one uses a MaskedArray:

>>>data = np.arange(12).reshape(3, 4)
>>>mask = np.array([[0., 0., 1., 0.],
                    [0., 0., 0., 1.],
                    [0., 1., 0., 0.]])

>>>masked = np.ma.array(data, mask=mask)
>>>masked

masked_array(
  data=[[0, 1, --, 3],
        [4, 5, 6, --],
        [8, --, 10, 11]],
  mask=[[False, False,  True, False],
        [False, False, False,  True],
        [False,  True, False, False]],
  fill_value=999999)

>>>masked.sum(axis=0)

masked_array(data=[12, 6, 16, 14], mask=[False, False, False, False], fill_value=999999)

I could just as easily well do the same thing this way:

>>>data = np.arange(12).reshape(3, 4).astype(float)
>>>mask = np.array([[0., 0., 1., 0.],
                    [0., 0., 0., 1.],
                    [0., 1., 0., 0.]]).astype(bool)

>>>masked = data.copy()  # this keeps the original data reuseable, as would
                         # the MaskedArray. If we only need to perform one 
                         # operation then we could avoid the copy
>>>masked[mask] = np.nan
>>>np.nansum(masked, axis=0)

array([12.,  6., 16., 14.])

I suppose the MaskedArray version looks a bit nicer, and avoids the copy if you need a reuseable array. Doesn't it use just as much memory when converting from standard ndarray to MaskedArray? And does it avoid the copy under the hood when applying the mask to the data? Are there other advantages?

378

asked May 04 '19 23:05

RedPanda

1 Answers

The official answer was reported here:

In theory, IEEE nan was specifically designed to address the problem of missing values, but the reality is that different platforms behave differently, making life more difficult. On some platforms, the presence of nan slows calculations 10-100 times. For integer data, no nan value exists.

In fact, masked arrays can be quite slow compared to the analogous array of nans:

import numpy as np
g = np.random.random((5000,5000))
indx = np.random.randint(0,4999,(500,2))
g_nan = g.copy()
g_nan[indx] = np.nan
mask =  np.full((5000,5000),False,dtype=bool)
mask[indx] = True
g_mask = np.ma.array(g,mask=mask)

%timeit (g_mask + g_mask)**2
# 1.27 s ± 35.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit (g_nan + g_nan)**2
# 76.5 ms ± 715 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

When are they useful?

In many years of programming, I found them useful on the following occasions:

when you want to preserve the values you masked for later processing, without copying the array.
you don't want to get tricked by the strange behaviour of nan operations (you might be tricked by the behaviour of masked array by the way).
when you have to handle many arrays with their masks if the mask is part of the array you avoid code and confusion.
you can assign different meanings to the masked value compared to the nan value. For instance, I use np.nan for missing values but I mask also the value with poor SNR, so I can identify both.

In general, you can consider a masked array as a more compact representation. The best approach is to test case by case the more comprehensible and efficient solution.

answered Oct 04 '22 16:10

G M

Related questions
                            
                                Pandas to Excel (Merged Header Column)
                            
                                tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)) in tensorflow
                            
                                How to calculate the average of the most recent three non-nan value using Python
                            
                                Custom Scoring Function in sklearn Cross Validate
                            
                                Python numpy array negative indexing
                            
                                how to send email with python directly from server and without smtp
                            
                                Robust way to manage and kill any process
                            
                                Java Socket fails to connect to "0.0.0.0" with NoRouteToHostException instead of ConnectionRefused
                            
                                Converting spanish date into python pandas datetime object with locale setting
                            
                                What would be the pythonic way to go to prevent circular loop while writing JSON?
                            
                                Applying function to columns of a Pandas DataFrame, conditional on data type
                            
                                How to use an equivalent to __post_init__ method with normal class?
                            
                                Create a mixed data generator (images,csv) in keras
                            
                                Keras training progress bar on one line with epoch number
                            
                                Using multiprocessing pool in Python
                            
                                Python: next in for loop
                            
                                Anaconda/Python site-packages subfolders with tilde in name - what are they?
                            
                                How to extract consecutive elements from an array containing NaN
                            
                                Compute the product of 3 dictionaries and concatenate keys and values
                            
                                Token extension versus matcher versus phrase matcher vs entity ruler in spaCy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why are Numpy masked arrays useful?

Tags:

python

numpy

masked-array

RedPanda

People also ask

1 Answers

When are they useful?

G M

Recent Activity

Donate For Us