Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

imputing missing values using a predictive model

Tags:

I am trying to impute missing values in Python and sklearn does not appear to have a method beyond average (mean, median, or mode) imputation. Orange imputation model seems to provide a viable option. However, it appears Orange.data.Table is not recognizing np.nan or somehow the imputation is failing.

import Orange
import numpy as np

tmp = np.array([[1, 2, np.nan, 5, 8, np.nan], [40, 4, 8, 1, 0.2, 9]])
data = Orange.data.Table(tmp)
imputer = Orange.feature.imputation.ModelConstructor()
imputer.learner_continuous = Orange.classification.tree.TreeLearner(min_subset=20)
imputer = imputer(data )
impdata = imputer(data)
for i in range(0, len(tmp)):
    print impdata[i]

Output is

[1.000, 2.000, 1.#QO, 5.000, 8.000, 1.#QO]
[40.000, 4.000, 8.000, 1.000, 0.200, 9.000]

Any idea what I am missing? Thanks!

like image 279
sedeh Avatar asked Sep 04 '16 18:09

sedeh


People also ask

Can missing values be predictive?

So, the probability of data being missing depends only on the observed data. In this case, the variables 'Gender' and 'Age' are related and the reason for missing values of the 'Age' variable can be explained by the 'Gender' variable but you can not predict the missing value itself.

What is generally the best method for imputing missing values of a categorical feature?

Imputation Method 1: Most Common Class One approach to imputing categorical features is to replace missing values with the most common class. You can do with by taking the index of the most common feature given in Pandas' value_counts function.

How do you impute missing values value imputation?

Missing values can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of each column in which the missing values are located. This class also allows for different missing values encodings. >>> import numpy as np >>> from sklearn.


1 Answers

It seems the issue is that a missing value in Orange is represented as ? or ~. Oddly enough, the Orange.data.Table(numpy.ndarray) constructor does not infer that numpy.nan should be converted to ? or ~ and instead converts them to 1.#QO. The custom function below, pandas_to_orange(), addresses this problem.

import Orange
import numpy as np
import pandas as pd

from collections import OrderedDict

# Adapted from https://github.com/biolab/orange3/issues/68

def construct_domain(df):
    columns = OrderedDict(df.dtypes)

    def create_variable(col):
        if col[1].__str__().startswith('float'):
            return Orange.feature.Continuous(col[0])
        if col[1].__str__().startswith('int') and len(df[col[0]].unique()) > 50:
            return Orange.feature.Continuous(col[0])
        if col[1].__str__().startswith('date'):
            df[col[0]] = df[col[0]].values.astype(np.str)
        if col[1].__str__() == 'object':
            df[col[0]] = df[col[0]].astype(type(""))
        return Orange.feature.Discrete(col[0], values = df[col[0]].unique().tolist())
    return Orange.data.Domain(list(map(create_variable, columns.items())))

def pandas_to_orange(df):
    domain = construct_domain(df)
    df[pd.isnull(df)]='?'
    return Orange.data.Table(Orange.data.Domain(domain), df.values.tolist())

df = pd.DataFrame({'col1':[1, 2, np.nan, 4, 5, 6, 7, 8, 9, np.nan, 11], 
                    'col2': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110.]}) 

tmp = pandas_to_orange(df)
for i in range(0, len(tmp)):
    print tmp[i]

The output is:

[1.000, 10.000]
[2.000, 20.000]
[?, 30.000]
[4.000, 40.000]
[5.000, 50.000]
[6.000, 60.000]
[7.000, 70.000]
[8.000, 80.000]
[9.000, 90.000]
[?, 100.000]
[11.000, 110.000]

The reason I wanted to properly encode the missing values is so I can use the Orange imputation library. However, it appears that the predictive tree model in the library does not do much more than simple mean imputation. Specifically, it imputes the same value for all missing values.

imputer = Orange.feature.imputation.ModelConstructor()
imputer.learner_continuous = Orange.classification.tree.TreeLearner(min_subset=20)
imputer = imputer(tmp )
impdata = imputer(tmp)
for i in range(0, len(tmp)):
    print impdata[i]

Here's the output:

[1.000, 10.000]
[2.000, 20.000]
[5.889, 30.000]
[4.000, 40.000]
[5.000, 50.000]
[6.000, 60.000]
[7.000, 70.000]
[8.000, 80.000]
[9.000, 90.000]
[5.889, 100.000]
[11.000, 110.000]

I was looking for something that will fit a model, say kNN, on the complete cases and use the fitted model to predict the missing cases. fancyimpute (a Python 3 package) does this but throws MemoryError on my 300K+ input.

from fancyimpute import KNN

df = pd.DataFrame({'col1':[1, 2, np.nan, 4, 5, 6, 7, 8, 9, np.nan, 11], 
                    'col2': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110.]}) 

X_filled_knn = KNN(k=3).complete(df)
X_filled_knn

Output is:

array([[   1.        ,   10.        ],
       [   2.        ,   20.        ],
       [   2.77777784,   30.        ],
       [   4.        ,   40.        ],
       [   5.        ,   50.        ],
       [   6.        ,   60.        ],
       [   7.        ,   70.        ],
       [   8.        ,   80.        ],
       [   9.        ,   90.        ],
       [   9.77777798,  100.        ],
       [  11.        ,  110.        ]])

I can probably find a workaround or split the dataset into chunks (not ideal).

like image 151
sedeh Avatar answered Sep 23 '22 17:09

sedeh