One of the things I deal with most in data cleaning is missing values. R deals with this well using its "NA" missing data label. In python, it appears that I'll have to deal with masked arrays which seem to be a major pain to set up and don't seem to be well documented. Any suggestions on making this process easier in Python? This is becoming a deal-breaker in moving into Python for data analysis. Thanks Update It's obviously been a while since I've looked at the methods in the numpy.ma module. It appears that at least the basic analysis functions are available for masked arrays, and the examples provided helped me understand how to create masked arrays (thanks to the authors). I would like to see if some of the newer statistical methods in Python (being developed in this year's GSoC) incorporates this aspect, and at least does the complete case analysis.

I also question the problem with masked arrays. Here are a couple of examples: <pre class="prettyprint"><code>import numpy as np data = np.ma.masked_array(np.arange(10)) data[5] = np.ma.masked # Mask a specific value data[data>6] = np.ma.masked # Mask any value greater than 6 # Same thing done at initialization time init_data = np.arange(10) data = np.ma.masked_array(init_data, mask=(init_data > 6)) </code></pre>

Masked arrays are the anwswer, as DpplerShift describes. For quick and dirty use, you can use fancy indexing with boolean arrays: <pre class="prettyprint"><code>>>> import numpy as np >>> data = np.arange(10) >>> valid_idx = data % 2 == 0 #pretend that even elements are missing >>> # Get non-missing data >>> data[valid_idx] array([0, 2, 4, 6, 8]) </code></pre> You can now use valid_idx as a quick mask on other data as well <pre class="prettyprint"><code>>>> comparison = np.arange(10) + 10 >>> comparison[valid_idx] array([10, 12, 14, 16, 18]) </code></pre>

How do you deal with missing data using numpy/scipy?

Tags:

python

numpy

data-analysis

One of the things I deal with most in data cleaning is missing values. R deals with this well using its "NA" missing data label. In python, it appears that I'll have to deal with masked arrays which seem to be a major pain to set up and don't seem to be well documented. Any suggestions on making this process easier in Python? This is becoming a deal-breaker in moving into Python for data analysis. Thanks

Update It's obviously been a while since I've looked at the methods in the numpy.ma module. It appears that at least the basic analysis functions are available for masked arrays, and the examples provided helped me understand how to create masked arrays (thanks to the authors). I would like to see if some of the newer statistical methods in Python (being developed in this year's GSoC) incorporates this aspect, and at least does the complete case analysis.

968

asked Sep 04 '09 03:09

Abhijit

3 Answers

If you are willing to consider a library, pandas (http://pandas.pydata.org/) is a library built on top of numpy which amongst many other things provides:

Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form

I've been using it for almost one year in the financial industry where missing and badly aligned data is the norm and it really made my life easier.

139

answered Oct 28 '22 22:10

aristotle

I also question the problem with masked arrays. Here are a couple of examples:

import numpy as np
data = np.ma.masked_array(np.arange(10))
data[5] = np.ma.masked # Mask a specific value

data[data>6] = np.ma.masked # Mask any value greater than 6

# Same thing done at initialization time
init_data = np.arange(10)
data = np.ma.masked_array(init_data, mask=(init_data > 6))

answered Oct 28 '22 21:10

DopplerShift

Masked arrays are the anwswer, as DpplerShift describes. For quick and dirty use, you can use fancy indexing with boolean arrays:

>>> import numpy as np
>>> data = np.arange(10)
>>> valid_idx = data % 2 == 0 #pretend that even elements are missing

>>> # Get non-missing data
>>> data[valid_idx]
array([0, 2, 4, 6, 8])

You can now use valid_idx as a quick mask on other data as well

>>> comparison = np.arange(10) + 10
>>> comparison[valid_idx]
array([10, 12, 14, 16, 18])

answered Oct 28 '22 22:10

Barry Wark

Related questions
                            
                                Cython attemps to compile twice, and fails
                            
                                Reading the output of Pythons memory_profiler
                            
                                Docker, Flask, SQLAlchemy: ValueError: invalid literal for int() with base 10: 'None'
                            
                                Why is an __init__ skipped when doing Base.__init__(self) in multiple inheritance instead of super().__init__()?
                            
                                multiple simultaneous connections on same jupyter notebook at the same time
                            
                                Is the xgboost documentation wrong ? (early stopping rounds and best and last iteration)
                            
                                Convert an equation to Python
                            
                                Can a Tensorflow variable be trained using the Tensorflow Keras functional API model? Can a Tensorflow operation be used in the functional API Model?
                            
                                Writing unit tests when using aiohttp and asyncio
                            
                                How to timeout an async test in pytest with fixture?
                            
                                Do Python have a model which is similar to nnetar in R's package forecast?
                            
                                Counting TCP retransmission in pyshark
                            
                                Spark DAG differs with 'withColumn' vs 'select'
                            
                                How do I enable Pylint in VSCode?
                            
                                Why does the pip requirements file contain "@file" instead of version number?
                            
                                Upload file using fastapi
                            
                                python ABC & Multiple Inheritance
                            
                                FastAPI - (psycopg2.OperationalError) server closed the connection unexpectedly
                            
                                Deploying Django: How do you do it?
                            
                                Setting object owner with generic create_object view in django

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With