Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you deal with missing data using numpy/scipy?

One of the things I deal with most in data cleaning is missing values. R deals with this well using its "NA" missing data label. In python, it appears that I'll have to deal with masked arrays which seem to be a major pain to set up and don't seem to be well documented. Any suggestions on making this process easier in Python? This is becoming a deal-breaker in moving into Python for data analysis. Thanks

Update It's obviously been a while since I've looked at the methods in the numpy.ma module. It appears that at least the basic analysis functions are available for masked arrays, and the examples provided helped me understand how to create masked arrays (thanks to the authors). I would like to see if some of the newer statistical methods in Python (being developed in this year's GSoC) incorporates this aspect, and at least does the complete case analysis.

like image 968
Abhijit Avatar asked Sep 04 '09 03:09

Abhijit


People also ask

Can Numpy handle missing data?

Working With Missing ValuesNumPy will gain a global singleton called numpy.NA, similar to None, but with semantics reflecting its status as a missing value. In particular, trying to treat it as a boolean will raise an exception, and comparisons with it will produce numpy.NA instead of True or False.

How do I fill missing values in Numpy?

In NumPy, to replace missing values NaN ( np. nan ) in ndarray with other numbers, use np. nan_to_num() or np. isnan() .

How do you handle missing data in Python?

In Pandas missing data is represented by two value: None: None is a Python singleton object that is often used for missing data in Python code. NaN : NaN (an acronym for Not a Number), is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation.


3 Answers

If you are willing to consider a library, pandas (http://pandas.pydata.org/) is a library built on top of numpy which amongst many other things provides:

Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form

I've been using it for almost one year in the financial industry where missing and badly aligned data is the norm and it really made my life easier.

like image 139
aristotle Avatar answered Oct 28 '22 22:10

aristotle


I also question the problem with masked arrays. Here are a couple of examples:

import numpy as np
data = np.ma.masked_array(np.arange(10))
data[5] = np.ma.masked # Mask a specific value

data[data>6] = np.ma.masked # Mask any value greater than 6

# Same thing done at initialization time
init_data = np.arange(10)
data = np.ma.masked_array(init_data, mask=(init_data > 6))
like image 34
DopplerShift Avatar answered Oct 28 '22 21:10

DopplerShift


Masked arrays are the anwswer, as DpplerShift describes. For quick and dirty use, you can use fancy indexing with boolean arrays:

>>> import numpy as np
>>> data = np.arange(10)
>>> valid_idx = data % 2 == 0 #pretend that even elements are missing

>>> # Get non-missing data
>>> data[valid_idx]
array([0, 2, 4, 6, 8])

You can now use valid_idx as a quick mask on other data as well

>>> comparison = np.arange(10) + 10
>>> comparison[valid_idx]
array([10, 12, 14, 16, 18])
like image 33
Barry Wark Avatar answered Oct 28 '22 22:10

Barry Wark