imputing missing values using a predictive model

Tags:

I am trying to impute missing values in Python and sklearn does not appear to have a method beyond average (mean, median, or mode) imputation. Orange imputation model seems to provide a viable option. However, it appears Orange.data.Table is not recognizing np.nan or somehow the imputation is failing.

import Orange
import numpy as np

tmp = np.array([[1, 2, np.nan, 5, 8, np.nan], [40, 4, 8, 1, 0.2, 9]])
data = Orange.data.Table(tmp)
imputer = Orange.feature.imputation.ModelConstructor()
imputer.learner_continuous = Orange.classification.tree.TreeLearner(min_subset=20)
imputer = imputer(data )
impdata = imputer(data)
for i in range(0, len(tmp)):
    print impdata[i]

Output is

[1.000, 2.000, 1.#QO, 5.000, 8.000, 1.#QO]
[40.000, 4.000, 8.000, 1.000, 0.200, 9.000]

Any idea what I am missing? Thanks!

279

asked Sep 04 '16 18:09

sedeh

1 Answers

It seems the issue is that a missing value in Orange is represented as ? or ~. Oddly enough, the Orange.data.Table(numpy.ndarray) constructor does not infer that numpy.nan should be converted to ? or ~ and instead converts them to 1.#QO. The custom function below, pandas_to_orange(), addresses this problem.

import Orange
import numpy as np
import pandas as pd

from collections import OrderedDict

# Adapted from https://github.com/biolab/orange3/issues/68

def construct_domain(df):
    columns = OrderedDict(df.dtypes)

    def create_variable(col):
        if col[1].__str__().startswith('float'):
            return Orange.feature.Continuous(col[0])
        if col[1].__str__().startswith('int') and len(df[col[0]].unique()) > 50:
            return Orange.feature.Continuous(col[0])
        if col[1].__str__().startswith('date'):
            df[col[0]] = df[col[0]].values.astype(np.str)
        if col[1].__str__() == 'object':
            df[col[0]] = df[col[0]].astype(type(""))
        return Orange.feature.Discrete(col[0], values = df[col[0]].unique().tolist())
    return Orange.data.Domain(list(map(create_variable, columns.items())))

def pandas_to_orange(df):
    domain = construct_domain(df)
    df[pd.isnull(df)]='?'
    return Orange.data.Table(Orange.data.Domain(domain), df.values.tolist())

df = pd.DataFrame({'col1':[1, 2, np.nan, 4, 5, 6, 7, 8, 9, np.nan, 11], 
                    'col2': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110.]}) 

tmp = pandas_to_orange(df)
for i in range(0, len(tmp)):
    print tmp[i]

The output is:

[1.000, 10.000]
[2.000, 20.000]
[?, 30.000]
[4.000, 40.000]
[5.000, 50.000]
[6.000, 60.000]
[7.000, 70.000]
[8.000, 80.000]
[9.000, 90.000]
[?, 100.000]
[11.000, 110.000]

The reason I wanted to properly encode the missing values is so I can use the Orange imputation library. However, it appears that the predictive tree model in the library does not do much more than simple mean imputation. Specifically, it imputes the same value for all missing values.

imputer = Orange.feature.imputation.ModelConstructor()
imputer.learner_continuous = Orange.classification.tree.TreeLearner(min_subset=20)
imputer = imputer(tmp )
impdata = imputer(tmp)
for i in range(0, len(tmp)):
    print impdata[i]

Here's the output:

[1.000, 10.000]
[2.000, 20.000]
[5.889, 30.000]
[4.000, 40.000]
[5.000, 50.000]
[6.000, 60.000]
[7.000, 70.000]
[8.000, 80.000]
[9.000, 90.000]
[5.889, 100.000]
[11.000, 110.000]

I was looking for something that will fit a model, say kNN, on the complete cases and use the fitted model to predict the missing cases. fancyimpute (a Python 3 package) does this but throws MemoryError on my 300K+ input.

from fancyimpute import KNN

df = pd.DataFrame({'col1':[1, 2, np.nan, 4, 5, 6, 7, 8, 9, np.nan, 11], 
                    'col2': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110.]}) 

X_filled_knn = KNN(k=3).complete(df)
X_filled_knn

Output is:

array([[   1.        ,   10.        ],
       [   2.        ,   20.        ],
       [   2.77777784,   30.        ],
       [   4.        ,   40.        ],
       [   5.        ,   50.        ],
       [   6.        ,   60.        ],
       [   7.        ,   70.        ],
       [   8.        ,   80.        ],
       [   9.        ,   90.        ],
       [   9.77777798,  100.        ],
       [  11.        ,  110.        ]])

I can probably find a workaround or split the dataset into chunks (not ideal).

151

answered Sep 23 '22 17:09

sedeh

Related questions
                            
                                Troubleshooting DLL function call from Excel Vba
                            
                                Inject EntityManagerFactory using @PersistenceUnit on Jersey with Wildfly
                            
                                iOS9 supportedInterfaceOrientationsForWindow stops getting called
                            
                                Python implementation of Mergesort for Linked list doesn't work
                            
                                dcast error: `Error in match(x, table, nomatch = 0L)`
                            
                                IE11 calculates the height of parent flex container incorrectly
                            
                                git clone in .bat script
                            
                                Django Database routing based on current user logged in
                            
                                async await or Promise not returning on stream event
                            
                                How can a macro definition be passed as an argument to make?
                            
                                Placing D3 tooltip in cursor location
                            
                                pip install --editable with a VCS url

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

imputing missing values using a predictive model

Tags:

sedeh

People also ask

1 Answers

sedeh

Recent Activity

Donate For Us