One of the things I deal with most in data cleaning is missing values. R deals with this well using its "NA" missing data label. In python, it appears that I'll have to deal with masked arrays which seem to be a major pain to set up and don't seem to be well documented. Any suggestions on making this process easier in Python? This is becoming a deal-breaker in moving into Python for data analysis. Thanks
Update It's obviously been a while since I've looked at the methods in the numpy.ma module. It appears that at least the basic analysis functions are available for masked arrays, and the examples provided helped me understand how to create masked arrays (thanks to the authors). I would like to see if some of the newer statistical methods in Python (being developed in this year's GSoC) incorporates this aspect, and at least does the complete case analysis.
Working With Missing ValuesNumPy will gain a global singleton called numpy.NA, similar to None, but with semantics reflecting its status as a missing value. In particular, trying to treat it as a boolean will raise an exception, and comparisons with it will produce numpy.NA instead of True or False.
In NumPy, to replace missing values NaN ( np. nan ) in ndarray with other numbers, use np. nan_to_num() or np. isnan() .
In Pandas missing data is represented by two value: None: None is a Python singleton object that is often used for missing data in Python code. NaN : NaN (an acronym for Not a Number), is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation.
If you are willing to consider a library, pandas (http://pandas.pydata.org/) is a library built on top of numpy which amongst many other things provides:
Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form
I've been using it for almost one year in the financial industry where missing and badly aligned data is the norm and it really made my life easier.
I also question the problem with masked arrays. Here are a couple of examples:
import numpy as np
data = np.ma.masked_array(np.arange(10))
data[5] = np.ma.masked # Mask a specific value
data[data>6] = np.ma.masked # Mask any value greater than 6
# Same thing done at initialization time
init_data = np.arange(10)
data = np.ma.masked_array(init_data, mask=(init_data > 6))
Masked arrays are the anwswer, as DpplerShift describes. For quick and dirty use, you can use fancy indexing with boolean arrays:
>>> import numpy as np
>>> data = np.arange(10)
>>> valid_idx = data % 2 == 0 #pretend that even elements are missing
>>> # Get non-missing data
>>> data[valid_idx]
array([0, 2, 4, 6, 8])
You can now use valid_idx as a quick mask on other data as well
>>> comparison = np.arange(10) + 10
>>> comparison[valid_idx]
array([10, 12, 14, 16, 18])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With